Hadoop is one of the most opted for course at IIHT. We are listing the most frequently asked Hadoop interview questions and should be of great help to you in your next interview. Keeping aside few interviewer specific twist during interviews, you’ll get a firm hold of basics of Hadoop and its framework in this blog.
1. What is Hadoop? What are its main components?
Hadoop is the infrastructure that provides tools and services for processing and storing huge data sets. Hadoop is highly regarded as the ‘solution’ to most Big Data challenges faced by organizations and make optimised business decisions.
Its main components are:
- HDFS – A Java-based reliable file system powered by Master-Slave Architecture that stores vast datasets in a block format.
- Hadoop MapReduce – A programming structure that helps process large datasets. ‘Map’ segregates the datasets into tuples, ‘reduce’ uses the map tuples and creates a combination of smaller chunks of tuples.
- Hadoop Common
- PIG and HIVE – The Data Access Components.
- HBase – For Data Storage
- Ambari, Oozie and ZooKeeper – Data Management and Monitoring Component
- Thrift and Avro – Data Serialization components
- Apache Flume, Sqoop, Chukwa – The Data Integration Components
- Apache Mahout and Drill – Data Intelligence Components
Few of the Hadoop tools that boost Big Data performance significantly are SQL, NoSQL, Oozie, Clouds, and ZooKeeper.
2. What are some of the practical applications of Hadoop?
Some of the real-world instances where Hadoop is playing a vital role are:
- City traffic management
- Fraud detection and prevention
- Improve customer service
- Personalisation of online experience
- Improvement of healthcare services.
3. What are the most common input formats in Hadoop?
There are three common input formats in Hadoop:
- Text- The default input format.
- Sequence- Used for reading files in sequence.
- Key Value- Used to read plain text files.
4. What is YARN?
YARN stands for “Yet Another Resource Negotiator”. This data processing framework manages data resources and creates successful processing environment.
5. What is the benefit of Checkpointing”?
Checkpointing is the procedure to form a new FsImage by combining FsImage and Edit log. It minimizes the startup time of NameNode to make the entire process more efficient.
6. What is “Rack Awareness”?
“Rack Awareness” is an algorithm used by ManeNode to determine the cluster storage pattern of the data blocks and their replicas. The rack definitions reduce the congestion between data nodes present in the same rack and help us achieve this.
7. How to debug a Hadoop code?
To debug a Hadoop code, first, we need to check the list of currently running MapReduce tasks followed by a check for simultaneously running orphaned tasks. In the case of running tasks, we need to find the location of Resource Manager logs by following these simple steps:
- Run “ps –ef | grep –I ResourceManager” and find errors related to a specific job id.
- Ientify the worker node used to execute the task.
- Log in to the node and run “ps –ef | grep –iNodeManager.”
- Finally, scrutinize the Node Manager log.
It is observed that most of the errors are generated at user level logs for a particular map-reduce job.
8. What are Active and Passive NameNodes?
Hadoop system usually has two NameNodes – Active NameNode and Passive NameNode.
The Active NameNode runs the Hadoop cluster and Passive NameNode is the standby NameNode that stores the data from the Active NameNode. Having two NameNodes is helpful in situations where the Active NameNode crashes. The Passive NameNode takes the lead in that case and the system never fails.
9. What are the modes in which Hadoop can run?
The modes in which Hadoop can run are:
- Standalone mode – Used for debugging purpose and does not support HDFS.
- Pseudo-distributed mode – It is required in the configuration of mapred-site.xml, core-site.xml, and hdfs-site.xml files.
- Fully-distributed mode – It is Hadoop’s production stage where data is distributed across various nodes on a Hadoop cluster with the Master and the Slave Nodes allotted separately.
10. What are the schedulers in the Hadoop framework?
There are three different schedulers in the Hadoop framework:
- COSHH – Helps schedule decisions. It reviews the cluster and workload combined in consideration of heterogeneity.
- FIFO Scheduler – It lines up jobs in a queue according to their time of arrival and does not consider heterogeneity in this.
- Fair Sharing – It creates a pool for individual users having multiple maps and reduce slots on a resource that they can use to execute specific jobs.
11. What is Speculative Execution?
Speculative Execution is a backup feature of Hadoop framework. Nodes running slower than the other nodes in the Hadoop framework may constrain the entire program. Hadoop detects or ‘speculates’ lagging tasks running and launches equivalent backup for that particular task. The master node then executes both the tasks simultaneously and the first completed task is accepted while killing the other one.
12. Name the components of Apache HBase?
Apache HBase is comprised of three components:
- Region Server: After a table is divided into multiple regions, clusters of these regions are forwarded to the clients via the Region Server.
- HMaster: This is a tool that helps manage and coordinate the Region server.
- ZooKeeper: It is a coordinator within the HBase distributed environment which maintain a server state inside the cluster through communication in sessions.
13. Explain the use of RecordReader in Hadoop?
RecordReader helps integrate data blocks BROKEN BY Hadoop into a single readable record.
For instance, if the input data is split into two blocks –
Row 1 – Learn latest
Row 2 – Technologies Swiftly
RecordReader reads this as “Learn latest Technologies Swiftly.”