Hadoop vs. HDFS vs. HBase vs. Hive

Hadoop is a framework to process/query the Big data while Hive is an SQL Based tool that builds over Hadoop to process the data. 2. Hive process/query all the data using HQL (Hive Query Language) it’s SQL-Like Language while Hadoop can understand Map Reduce only.

What is difference between Hadoop and Hive?

Difference Between Hadoop and Hive

Hive: Hive is an application that runs over the Hadoop framework and provides SQL like interface for processing/query the data. Hive is designed and developed by Facebook before becoming part of the Apache-Hadoop project. Hive runs its query using HQL (Hive query language). Hive is having the same structure as RDBMS and almost the same commands can be used in Hive. Hive can store the data in external tables so it’s not mandatory to used HDFS also it support file formats such as ORC, Avro files, Sequence File and Text files, etc.

Hadoop vs. HDFS vs. HBase vs. Hive

It then organizes the data into HDFS tables and runs the jobs on a cluster to produce results. Hive is a simple way to apply structure to large amounts of unstructured data and then perform SQL based queries on them. Since it uses an interface that’s familiar with JDBC (Java Database Connectivity), it can easily integrate with traditional data center technologies.

The name node stores the metadata where all the data is being stored in the DataNodes. Also, if your NameNode goes down and you don’t have any backup, then your whole Hadoop instance will be unreachable. It’s a bit like losing the pointer when iterating over a linked list. If you don’t know where your data is stored next, you can’t get to it.

The DataNodes, on the other hand, are where the data is actually stored. If any specific DataNode is down, this should be OK because the NameNode will often manage multiple instances of the same blocks of data across data nodes (this is somewhat dependent on configuration).

The fact that you could run HDFS across cheap hardware and easily scale horizontally (which refers to buying more machines to handle data processing) has made it a highly popular option. Previously, most companies relied on vertical scaling (buying servers that are often expensive but can individually process more data). This was expensive and had more computational limitations.

HBase is part of the Hadoop ecosystem that provides read and write access in real-time for data in the Hadoop file system. Many big companies use HBase for their day-to-day functions for the same reason. Pinterest, for instance, works with 38 clusters of HBase to perform around 5 million operations every second!

Leave a Comment