Firstly, it's important to note that Hadoop and Spark are broadly different technologies, with different use cases. The Apache Software Foundation, from which both technologies emerged, even places the two in different categories: Hadoop is a database, Spark is a big data tool.
In Apache's own words, Hadoop is: a"distributed computing platform": "A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer."
In a large majority of cases when someone talks about Hadoop they mean the Hadoop Distributed File System (HDFS) which is "a distributed file system that provides high-throughput access to application data". Then there is Hadoop YARN, a job scheduling and cluster resource management tool, and Hadoop MapReduce for parallel processing of large data sets.
Spark, on the other hand, is: "A fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics."
So how do they come together?
Both are big data frameworks. Basically, if you're a company with a fast-growing pool of data, Hadoop is open source software that will allow you to store this data in a reliable and secure way. Spark is a tool for understanding that data. If Hadoop is the Bible written in Russian, Spark is a Russian dictionary and phrasebook.
You can run Spark for your big data projects on HDFS, or another file system, such as NoSQL databases with Spark connectors like MongoDB, Couchbase or Teradata (see: vendors).
Will Gardella, director of product management at Couchbase, says: "When it comes to Hadoop you don't have a single technology, it is a giant family of technologies and at the bottom you have the distributed file network that everyone loves. HDFS solves unglamorous, difficult problems well and it lets you store as much stuff as you want and not worry about stuff getting corrupted. That part people rely on."
The choice really comes down to what you want to do with your data and the skill set of your IT staff. Once your data is in Hadoop there are lots of ways to extract value from it. You can go down the standard analytics route of plugging a tool into the data lake for data cleansing, querying and visualisation.
Big players in the analytics and business intelligence market like Splunk offer Hadoop-integrated products, and data-visualisation firms like Tableau will let you present this data back to non-data people.
Spark on the other hand, as Gardella says, is useful "if you want to give your staff access to genuine real-time data with the intention that they, or an algorithm, will make decisions off the back of that data."
If your data is simply a large amount of structured data, such as a database of medical records, then the streaming capabilities of Spark aren't strictly necessary.
Hadoop vs Spark: Pros and Cons
Reliability: One major benefit with Hadoop is that because it's a distributed platform it's less prone to failure, allowing underlying data to be always available. This is why it is the chosen database of many web companies, because the internet never sleeps.
Cost: Hadoop and Spark are both projects from the Apache Software Foundation, so they are free and open source. The price comes from how you want to implement it, total cost of ownership, the time and resource related to implementation due to the skills required and the hardware. This also makes it highly scalable as your data lake grows.
The licensing model of traditional database providers like Oracle and SAP has long been the bane of many CIOs, so the Software-as-a-Service model provided by most of the Hadoop/Spark specialists gives greater flexibility while you figure out if the technology is useful.
Speed: Spark is reported to run up to 100 times faster than Hadoop MapReduce, according to the Apache foundation. This is because Spark works in-memory rather than reading and writing to and from hard drives. MapReduce will read data from the cluster, perform an operation and write the results back to the cluster, which takes time, whereas Spark performs this process in one place.
Generality: Couchbase's Gardella says: "Spark can load data from every place: Couchbase, MySQL, Amazon S3, HDFS. All of the formats you expect load out of HBase. It makes Spark very versatile."
Skills: Whatever the vendors tell you, Spark is not an easy tool to use. It is intended for data analysts and experts and is generally applied to deeply complex and constantly changing streaming data sets.
Hadoop vs Spark: Use Cases
Due to its ability to store more and more data, some classic Hadoop use cases include a 360-degree view of your customers, recommendation engines for retailers and security and risk management.
Retailers and Internet of Things companies are also interested in Spark, however, because of its ability to conduct real-time interactive data analytics to deliver greater personalisation.
According to MongoDB's VP of strategy Kelly Stirman, Spark's growing popularity is due to its compatibility with one important use case: machine learning.
Stirman tells Computerworld UK: "Ten years into Hadoop and the hallmarks are still promising but most people have found it hard to use and not well suited to artificial intelligence and machine learning."
Gardella at Couchbase agrees: "I think the way Spark executes in machine learning is far better. It is clear that machine learning use cases have been stronger on the Spark side than on the Hadoop side."
Hadoop vs Spark: Customers
In a broad sense, a Hadoop specialist vendor like Hortonworks claims to work with 55 of the top 100 financial services companies and 75 of the top 100 retailers. Actual use cases are harder to come by, perhaps because the technology isn't as mature as the vendors would lead us to believe, perhaps because the customers still see the technology as a secret sauce.
DataStax customer British Gas Connected Homes is using Spark and Apache Cassandra to deliver real-time usage statistics to its customers from its smart-home devices.
Head of data and analytics at British Gas, Jim Anning, says: "We always knew we were doing the Internet of Things and we know that the number of connected devices is only going to rise. Those sensors are collecting data all the time. For example, our temperature sensor is delivering data every couple of minutes. Scaling that process with a traditional, relational database just wasn't going to cut it."
Innovative electric car maker Tesla uses Hadoop for its connected car data, travel booking company Expedia has been moving its data into a Hadoop environment as it continues to scale, and British Airways is a big exponent of Hadoop for data storage and analytics.
AccordGardella says Couchbase is seeing: "Companies that need to get the accounting done right or they go to prison are still running on traditional data warehouses. Banks and retailers are moving to Hadoop because it is just so much cheaper and more flexible and they have the guys with the data analyst skills you need."
Hadoop vs Spark: Vendors
Implementing Hadoop is possible in-house - Apache provides all the documentation required - or you can pick a vendor to conduct an enterprise deployment for you, complete with support. Spark is similar: do it yourself or go to a vendor, such as Hortonworks' Spark at Scale, Cloudera or MapR.
As of December 2015, Gartner has seven vendors offering commercial editions of Hadoop: Amazon, IBM, Pivotal, Transwarp, Hortonworks, Cloudera, MapR. Vendors like Couchbase, MongoDB, DataStax, Basho and MemSQL offer Spark built on competing data management platforms.
NoSQL database vendors have been launching Spark connectors over the last year or so, with MongoDB being one of the most recent entries. VP of strategy Kelly Stirman says MongoDB is different because: "People view a connector as a marketing tactic to draw people into their funnels, so you see incomplete or not-feature rich connectors to check the box."[What does this mean? Suggest killing]
In their market overview for Hadoop distributions, Gartner analysts Nick Heudecker, Merv Adrian and Ankush Jain say: "These changes in the Hadoop ecosystem come at a time when information management megavendors IBM (which offers its own distribution), Microsoft, Oracle, SAP and Teradata have largely completed their integration with the salient elements of Hadoop.
"They are incorporating it into their broader portfolio of capabilities, such as event stream processing, analytics, database management systems (DBMS), data federation and integration, metadata management, security and governance. The Hadoop distributors are also adding these capabilities, either through partnerships or managed development efforts."
Computerworld UK will be doing a deeper dive on Hadoop and Spark vendors, including a comparison of each and their individual merits/issues in the coming months.
Hadoop vs Spark: Conclusion
Despite its relative maturity, compared to Spark, Hadoop still isn't delivering the kind of transformative results many vendors will claim. According to Gartner's market guide: "Through 2018, 70 percent of Hadoop deployments will fail to meet cost savings and revenue generation objectives due to skills and integration challenges."
The answer? "Match projects to specific business requirements and identify the existence and readiness of supported technology components suitable for them," says Gartner.
Spark, on the other hand, has the potential to be truly transformative for the right kind of companies with the relevant expertise. As Gartner puts it: "Apache Spark emerged as a force as potentially disruptive to Hadoop as Hadoop was to traditional database management systems."
Although some IT departments may feel compelled to pick between Hadoop and Spark, the fact is it's not quite a straight fight. They can be complementary technologies working in tandem, or depending on your data, one may be better suited than the other.
But one of the major advantages either way is that your organisation will be able to resist vendor lock-in, so with the right team, getting a proof of concept off the ground is easier than ever.