Distributed Computing for Data Scientists: Intro to Big Data, Hadoop & Spark

If you’ve spent any time in the software industry, you’ve probably run across the phrase “Big Data”. In this post, we’ll define this pervasive phrase and explain how to handle/analyze Big Data on a project.

How big is Big Data anyway? In general, Big Data means that the data is too large to fit in memory on a single computer and therefore it’d be beneficial to switch to a distributed system of computers for processing. By this definition, many prefer to set the size of “big data” to be anything over 32GB as this the upper echelon of RAM size for a personal computer. However, the definition varies depending on who you ask.

There’s more to Big Data than size. It’s actually a field of study (according to Wikipedia) that includes how to analyze, store, query, transfer, visualize, etc. large amounts of data.

Now that you can define Big Data, how would you run an analysis on data of this size? There are two different ways of scaling your computer architecture so that it may successfully analyze Big Data: vertical scaling and horizontal scaling.

Vertical scaling involves upgrading the parts of a single computer. This means you are buying more powerful parts (RAM, disk storage, processor) for a single machine. This may work for a large number of projects, but in other cases the size of the data is so large that it could take days for the computer to process your task. This is where the rubber meets the road and you need to consider horizontal scaling.

Horizontal scaling means buying/renting more computers and connecting them to work together. The beauty here is that you can add as many computers as needed.

You may hear a single computer referred to as a local system, and similarly, a connected system of computers set up to work in unison referred to as a distributed system (aka a system that’s horizontally scaled). As you probably guessed, distributed systems are used to more efficiently process Big Data.

We’ve not mentioned a key part of a distributed system. As you can imagine, simply connecting many computers (called “nodes”) does not automatically give them the ability to distribute a dataset, run a processing task on that data, and then work together to combine their results. Therefore, new computer software had to be written so that a distributed system could effectively and efficiently delegate a single task among its pooled resources. Hence, two wildly popular open-source Big Data projects were developed: Apache Hadoop (2006) and Apache Spark (2014).

Apache Hadoop and Apache Spark

Let’s gain an understanding of Apache Hadoop and Apache Spark. Both softwares are ways to split up computational work across a distributed system. But the main difference is that Hadoop’s computational component (called MapReduce) reads/writes from the disk storage of your computer and Spark keeps as much data in RAM as possible. Therefore, Spark is generally much faster than Hadoop’s MapReduce.

Apache Hadoop is composed of two technologies:

1) Hadoop Distributed File System (HDFS). This is the data storage component of Apache Hadoop. It splits up large files across multiple machines. It also exhibits fault tolerance by replicating data (3x) across different machines so that if one machine crashes, you’ll still have the data backed-up. By default, each block of data given to a machine is 128 MB.

2) MapReduce. This is the computational component of Apache Hadoop. It controls the delegation of tasks to each of the machines and then collects their outputs.

The way Spark increases the processing speed of an application is by keeping more data in-memory. On the flip side, Hadoop’s MapReduce reads and writes data to disk (to the HDFS) for each operation that it performs. How much faster is Spark than Hadoop MapReduce? 100x faster in-memory processing and 10x on-disc processing!

Spark works in-memory using what they deem a Resiliently Distributed Dataset (RDD), which is an abstraction layer that allows Spark to accept a variety of data sources, distribute the data, remain resilient (by storing the lineage of the manipulated RDDs), and perform parallelization. Spark has a number of libraries that sit on top of it conducive for data scientists such as Spark MLLib (machine learning library), Spark SQL, Spark GraphX, Spark R, and Spark Streaming (for streaming data). Spark supports API’s in Java, Scala, Python, and R.

As an aside, if you are serious about getting to work with Spark, you will need to be familiar with how to call the Spark API in order to run your tasks. The primary API for Spark’s Machine Learning library is now what they refer to as the DataFrame-based API and not the RDD-based API. Under the hood, the DataFrame API still uses RDDs. The benefit of using the DataFrame-based API over the RDD-based API generally comes down to allowing the developer to specify what they want to run in terms of processing work instead of how to run it. In other words, it’s a higher layer of abstraction that places the optimization of Spark code in the hands of the API itself as opposed to the developer. If interested in learning more about the background of the Spark API’s, I recommend the below video from Databricks, a company founded by the original creators of Spark.

So now that you are at least conversational in Big Data, Hadoop, and Spark, it’s time time to take the next step! If you are interested in learning how to use Spark with Python for data science, I recommend the class “Spark and Python for Big Data with PySpark” by Jose Portilla. Disclaimer: this endorsement doesn’t benefit me in any way; it’s just a Udemy class I found helpful. Thanks for taking the time to learn with us! Below you’ll find additional comparisons between Hadoop & Spark.

Additional Similarities:

-Both are free of charge. The only charges of using them is from the infrastructure used. Since Spark uses lots of RAM (more expensive) instead of lots of disk space (less expensive), Spark will likely cost more to use than Hadoop. This isn’t always the case, since Spark requires significantly fewer systems that Hadoop.

-Both owned by Apache and open source.

Additional Differences:

-Spark can perform operations up to 100x faster than Hadoop’s MapReduce by keeping as much data in-memory as possible. Hadoop stores data on-disk instead.

-Spark will outperform Hadoop MapReduce significantly when the computation involves applying a function repeatedly to the same dataset (such as when running iterative algorithms like gradient descent to optimize a parameter or k-means clustering). This is because each MapReduce job must reload the data from disk (each job would read the data in, do the Map/Shuffle/Reduce phases, and write data back to disk), which hinders performance on iterative algorithms. Spark removes this performance penalty by caching the data in-memory. As you’d expect, if the data is too large to fit in-memory and Spark overflows to disk storage, this performance penalty returns! See this StackOverflow discussion to witness this happening when the popular text analytics algorithm “Word2Vec” was being trained in a Spark cluster that did not have enough RAM to fit all of the data.

-Hadoop uses HDFS (Hadoop Distributed File System) for its data storage but Spark can use other file systems and doesn’t come with a distributed filesystem out of the box. It should be noted that while HDFS is the traditional filesystem used for Hadoop, there are also ways to connect it to other filesystems like cloud storage (Amazon S3, Azure Data Lake, etc).

-Spark can perform real-time data processing and batch processing. Hadoop can only perform batch processing. In batch processing, data is first collected and then processing is completed at a later stage. In streamed/real-time processing, data is collected and immediately processed for real-time insights. For example, imagine you want to gauge the sentiment of fans’ tweets as they happen at the Super Bowl. You could create an application that accepts tweets as they occur and attempts to extract the meaning of the tweets so that it can immediately populate a webpage showing the particular subjects people are tweeting about in real-time. This is in contrast to batch processing, which would instead collect all of the tweets from the Super Bowl after the fact and then perform some analytics task on this collected data.

-Spark was written in Scala while Hadoop was written in Java.