Apache Hadoop is an open-source software framework that
Stores big data in a distributed manner
Processes big data parallelly
Builds on large clusters of commodity hardware
Based on Google’s papers on Google File System (2003) and MapReduce (2004)
Hadoop is:
Scalable to Petabytes or more easily(Volume)
Offering parallel data processing(Velocity)
Storing all kinds of data(Variety)
Hadoop offers:
Redundant, Fault-tolerant data storage (HDFS)
Parallel computation framework (MapReduce)
Job coordination/scheduling (YARN)
Programmers no longer need to worry about:
Where the file is located?
How to handle failures & data lost?
How to divide computation?
How to program for scaling?
Hadoop Ecosystem
Core of Hadoop:
Hadoop distributed file system (
MapReduce
YARN (Yet Another Resource Negotiator) (from Hadoop v2.0)
Additional software packages:
Pig
Hive
Spark
HBase
........
The Master-Slave Architecture of Hadoop
Hadoop Distributed File Systems(HDFS)
HDFS is a file system that
follows master slave architecture
allows us to store data over multiple nodes(machines)
allows multiple users to access data.
just like file systems in your PC
HDFS supports
distributed storage
distributed computation
horizontal scalability
Vertical Scaling vs. Horizontal Scaling
HDFS Architecture
NameNode
NameNode maintains and manages the blocks in the DataNodes (slave nodes).
Master node
Functions:
records the metadata of all the files
FsImage: file system namespace since our name node is started
It records all changes
EditLogs: all the recent modifications e.g for the past 1 hour
records each change to the metadata
regularly checks the status of data nodes
keeps a record of all the blocks in HDFS
if the DataNode fails, handle data recovery
DataNode
A commodity hardware stores the data
Slave node
Functions
stores actual data
performs the read and write requests
reports the health to NameNode (heartbeat)
NameNode vs. DataNode
If NameNode failed…
All the files on HDFS will be lost
there’s no way to reconstruct the files from the blocks in DataNodes without the metadata in NameNode
In order to make NameNode resilient to failure
back up metadata in NameNode (with a remote NFS mount)
Secondary NameNode
Secondary NameNode
Take checkpoints of the file system metadata present on NameNode
It is not a backup NameNode!
Functions:
Stores a copy of FsImage file and Editlogs
Periodically applies Editlogs to FsImage and refreshes the Editlogs
If NameNode is failed, File System metadata can be recovered from the last saved FsImage on the Secondary NameNode
NameNode vs. Secondary NameNode
Blocks
Block is a sequence of bytes that stores data
Data stores as a set of blocks in HDFS
Default block size is 128M eta B ytes (Hadoop 2.x and 3.x) x), the default size in hadoob is 64 metab bytes
A file is spitted into multiple blocks
Why Large Block Size?
HDFS stores huge datasets
If block size is small (e.g., 4KB in Linux), then the number of blocks is large:
too much metadata for NameNode
too many seeks affect the read speed
read speed=seek time+transfer time
tranfer time=total size of file/transportation speed
harm the performance of MapReduce too
We don’t recommend using HDFS for small files due to similar reasons.
If DataNode Failed…
Commodity hardware fails
If NameNode hasn’t heard from a DataNode for 10mins, The DataNode is considered dead…
HDFS guarantees data reliability by generating multiple replications of data
each block has 3 replications by default
replications will be stored on different DataNodes
if blocks were lost due to the failure of a DataNode, they can be recovered from other replications
the total consumed space is 3 times the data size
It also helps to maintain data integrity (whether The data stored is correct or not
File, Block and Replica
A file contains one or more blocks
Blocks are different
Depends on the file size and block size
A block has multiple replicas
Replicas are the same
Depends on the preset replication factor
Replication Management
Each block is replicated 3 times and stored on different DataNodes
Why default replication factor 3?
If 1 replicate
DataNode fails, block lost
Assume
# of nodes N = 4000
# of blocks R = 1,000,000
Node failure rate FPD = 1 per day u expect to see 1 machine fail per day
If one node fails, then R/N = 250 blocks are lost
E(# of losing blocks in one day) 250
Let the number of losing blocks follows Poisson distribution, then
Pr[# of losing blocks in one day >= 250] = 0.508
If you are looking more about Hadoop or looking for project assignment help related to Hadoop, big data, spark, etc. then you can send the details of your requirement at below contact id:
contact@codersarts.com
Comments