Hadoop and the world of big data have transitioned from being buzzwords and hype to being a reality. Businesses have realised the value of big data and are beginning to understand the use cases for the technologies in the Hadoop ecosystem.
While the penny may have dropped for some business, Hadoop has not yet forged a name for itself as being the most simplistic set of technologies to use, mystery about what exactly it is only adds to the complexity.
Fret no further for CBR is here to help with a guide of how to get started with Hadoop, how to use it and some use cases to help identify where businesses can use it.
Firstly, Hadoop is a free, Java-based programming framework that is designed to support the processing of large data sets in a distributed computing environment. If two heads are better than one then multiple heads are even better.
Hadoop’s ability to run in this environment allows applications to operate on systems with thousands of nodes that can be thousands of terabytes with fast data transfer rates.
Hadoop has a broad ecosystem with technology such as MapReduce, Hadoop distributed file system and other projects such as Apache Hive, HBase and Spark.
Now that you know broadly what Hadoop is and some of the technologies, we can move on to how to use it.
The easy consumption of the technology has improved since it first appeared due to the work of both the Apache Hadoop community as a whole and due to the work of vendors such as Hortonworks, Cloudera and MapR.
As with most technologies, a set of best practices should be followed before a wide-scale deployment.
It has been noted in an Intel whitepaper on best practices for implementing Apache Hadoop software that the hardware, networking and software selection for a Hadoop implementation can significantly influence performance, total cost of ownership and return on investment.
Therefore it is necessary to combine a cost-effective infrastructure with a Hadoop distribution that is optimised for performance.
The whitepaper says: "Maximizing performance requires implementing the most effective data transfer and integration methods for the existing enterprise BI platforms and the selected Hadoop platform, as well as processes to ensure optimal use in a multitenancy environment."
For the business it is necessary to first select a well-defined use case, Hadoop technologies aren’t suited to all business needs, so decide what the use case will be.
Hadoop is particularly well suited to processing large amount of unstructured or semi-structured data. This isn’t to say that the technologies in Hadoop are usable for smaller amounts of data. With technologies such as Spark the Hadoop world has created the capability to process real-time data for fast insights.
Finding the right use case is a significant step to the path to Hadoop and this could be something like trying out batch aggregation with a technology from the likes of MongoDB.
While the MongoDB technology already has built-in aggregation functionality, for more complex data aggregation Hadoop can be the right route to follow.
In this scenario the data can be pulled from MongoDB and processed within Hadoop using one or more MapReduce jobs, not only can the user bring in data from the Mongo database but it can also be brought in from other sources in order to create a multi datasource solution.
Once the data is in MapReduce the output from the jobs can be written back to Mongo for further querying and ad-hoc analysis.
Developers using Hadoop require a Java programming environment, so downloading a Java Development kit is a good place to start, Hadoop requires Java Standard Edition version 6.
Yahoo helps developers to get started with the technology by providing a virtual machine image that contains a preconfigured Hadoop installation. The VM image runs out of a "sandbox" environment which can run another operating system.
Hadoop is complex technology but an important thing to remember is that the business doesn’t necessarily need to know how complex it is or how it works. The work done by the Apache Hadoop developer community along with the large vendors has helped to mask the complexities so that businesses can instead focus on putting it to use.
Some of the most common use cases include using it as a data refinery, using the technology to incorporate new data sources into their commonly used BI or analytic applications.
Data exploration is another common use case, which is where organisations are capturing and storing large quantities of data, perhaps in a Data Lake, and then analysing that data. Basically the data is left in Hadoop so that it can be explored directly from it, rather than using it as a staging area for processing before putting it into a data warehouse.
A third use case is that of application enrichment which takes data that is stored in Hadoop in order to direct an application’s behaviour. This can include storing web session data in order to personalise the customer’s experience by looking at habits and particular browsing habits.
Not only can this be used to generally improve the browsing experience but it can also help to provide offers to website users that might help to ensure a purchase.
Although the technology is complex it is not something that is fundamental to the user being able to use it. The important elements are knowing your use cases and having best practices in place in order to make sure it benefits the business.