What is Hadoop Distributed File System (HDFS)?

Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn

What is HDFS?HDFS stands for Hadoop Distributed File System. The function of HDFS is to operate as a distributed file system designed to run on commodity hardware.  HDFS is fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets and enables streaming access to file system data in

  • Fault tolerance, HDFS has been designed to detect faults and automatically recover quickly ensuring continuity and reliability.
  • Speed, because of its cluster architecture, it can maintain 2 GB of data per second.
  • Access to more types of data, specifically Streaming data because of its design to handle large amounts of data for batch processing it allows for high data throughputs rates making it ideal to support streaming data.
  • Compatibility and Portability, HDFS is designed to be portable across a variety of hardware setups and compatible with several underlying operating systems ultimately providing users optionality to use HDFS with their own tailored setup. These advantages are especially significant when dealing with big data and were made possible with the particular way HDFS handles data.
  • Scalable. You can scale resources according to the size of your file system. HDFS includes vertical and horizontal scalability mechanisms.
  • How does HDFS store data?The HDFS file system consists of a set of Master services (Namenode, Secondary Namenode,  and Datanodes).  The Namenode and Secondary Namenode manage the HDFS metadata.  The Datanodes host the underlying HDFS data. The Namenode tracks which Datanodes contain the contents of a given file in HDFS.HDFS divides files into blocks and stores each block on a DataNode.  Multiple DataNodes are linked to the cluster, NameNode. The NameNode then distributes replicas of these data blocks across the cluster. It also instructs the user or application where to locate wanted information.

  • By default, HDFS is configured with 3x replication which means datasets will have 2 additional copies. While this improves the likelihood of localized data during processing, it does introduce an overhead in storage costs.
  • HDFS works best when configured with locally attached storage. This ensures the best performance for the file system.
  • Increasing the capacity of HDFS requires the addition of new servers (compute, memory, disk), not just storage media.
  • Leave a Comment