What is difference between Hadoop and Hive?
Difference Between Hadoop and Hive
Hive: Hive is an application that runs over the Hadoop framework and provides SQL like interface for processing/query the data. Hive is designed and developed by Facebook before becoming part of the Apache-Hadoop project. Hive runs its query using HQL (Hive query language). Hive is having the same structure as RDBMS and almost the same commands can be used in Hive. Hive can store the data in external tables so it’s not mandatory to used HDFS also it support file formats such as ORC, Avro files, Sequence File and Text files, etc.
Hadoop vs. HDFS vs. HBase vs. Hive
It then organizes the data into HDFS tables and runs the jobs on a cluster to produce results. Hive is a simple way to apply structure to large amounts of unstructured data and then perform SQL based queries on them. Since it uses an interface that’s familiar with JDBC (Java Database Connectivity), it can easily integrate with traditional data center technologies.
The name node stores the metadata where all the data is being stored in the
DataNodes. Also, if your
NameNode goes down and you don’t have any backup, then your whole Hadoop instance will be unreachable. It’s a bit like losing the pointer when iterating over a linked list. If you don’t know where your data is stored next, you can’t get to it.
DataNodes, on the other hand, are where the data is actually stored. If any specific
DataNode is down, this should be OK because the
NameNode will often manage multiple instances of the same blocks of data across data nodes (this is somewhat dependent on configuration).
The fact that you could run HDFS across cheap hardware and easily scale horizontally (which refers to buying more machines to handle data processing) has made it a highly popular option. Previously, most companies relied on vertical scaling (buying servers that are often expensive but can individually process more data). This was expensive and had more computational limitations.
HBase is part of the Hadoop ecosystem that provides read and write access in real-time for data in the Hadoop file system. Many big companies use HBase for their day-to-day functions for the same reason. Pinterest, for instance, works with 38 clusters of HBase to perform around 5 million operations every second!