Hive is a data warehouse infrastructure that is built on top of Hadoop to provide data, querying, and analysis.
As the importance of big data has grown within organisations, so has the amount of
tools that have been made available to store, clean, and process it.
One of those tools is Apache Hive, a data warehouse infrastructure that is built on top of Hadoop to provide data summarisation, query, and analysis.
Originally developed by Facebook, Hive is now used and developed by the Apache community and by companies such as Amazon, which included it in Amazon Elastic MapReduce on Amazon Web Services.
But what jobs is it best suited to?
The main features of Hive are that it supports the analysis of large datasets stored in
Hadoop’s HDFS, as well as compatible file systems such as Amazon S3.
It uses a SQL-like language called HiveQL with schema on read and converts queries to MapReduce, Apache Tex, and Spark jobs.
The open source framework is best used for batch jobs over large sets of append-only data, and is not designed for OLTP workloads, nor does it offer real-time queries or row-level updates.