Mahendra Kumar blog: Hive

Hive is the dataware housing system for Hadoop. It provides a way to issue SQL like statements to Hadoop. Hive provides a mechanism to project structure onto data. In a core hadoop setup, user has to write map reduce programs in java or use scripting languages and stream them using hadoop streaming. So, for users who are familiar with SQL language, hive provides an abstraction on top of map reduce. Hive query language is known as HQL, is very similar to SQL, though, it does not support correlated subqueries and other sql constructs. Hive will decompose a HQL statement into map reduce job and execute it on hadoop cluster. Hadoop cluster itself is agnostic to Hive. For hadoop, hive is like any other map reduce program using data from hdfs. Hive creates a directory by name /user/hive/warehouse to store tables. So, a table is just another folder with some data files in hdfs. Partitions are stored in the subfolder under a table folder. Since, hdfs is write-once file system, there is no support for modifying or deleting individual records through hive.

Hive surfaces relational concepts such as tables and columns on the data in hadoop. This data model is stored in hive meta store. Meta store is a repository for hive meta data and is stored in a relational database. By default, hive uses derby. The data in hive table can be partitioned by date. In addition, hive also allows data to be further partitioned into buckets. Bucketing helps with sampling and predictive algorithms. At data load time, Hive does not check for correctness of the structure of the data when loading into tables.

Hive supports following data types :

TINYINT - 1 byte integer

SMALLINT - 2 byte integer

INT - 4 byte integer

BIGINT - 8 byte integer

BOOLEAN - TRUE/FALSE

FLOAT - single precision

DOUBLE - Double precision

STRING

Mahendra Kumar blog

Pages

Monday, November 5, 2012

Hive

No comments:

Post a Comment