Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,143,326 members, 7,780,814 topics. Date: Thursday, 28 March 2024 at 11:05 PM

Hadoop Tutorial - Education - Nairaland

Nairaland Forum / Nairaland / General / Education / Hadoop Tutorial (304 Views)

Hadoop Tutorial For Beginners | Big Data | Hadoop Training For Beginners / Hadoop Projects | Big Data Real Time Project | Hadoop Tutorial For Beginners / Hadoop Tutorial For Beginners | Big Data Tutorial (2) (3) (4)

(1) (Reply)

Hadoop Tutorial by Ttacy341(f): 11:56am On Sep 13, 2017
BigData

Big Data is a term that represents data sets whose size is beyond the capacity of commonly used software tools to manage and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. It is the term for a collection of data sets, so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Bigdata is term which defines three characteristics

Volum
Velocity
Variety
Already we have RDBMS to store and process structured data. But of late we have been getting data in form of videos, images and text. This data is called as unstructured data and semi-structured data. It is difficult to efficiently store and process these data using RDBMS. So definitely, we have to find an alternative way to store and to process this type of unstructured and semi-structured data.

HADOOP is one of the technologies to efficiently store and to process large set of data. This HADOOP is entirely different from Traditional distributed file system. It can overcome all the problems exits in the traditional distributed systems. HADOOP is an open source framework written in java for storing data in a distributed file system and processing the data in parallel manner across cluster of commodity nodes.

The Motivation for Hadoop


What problems exist with ‘traditional’ large-scale computing systems?

What requirements an alternative approach should have?

How Hadoop addresses those requirements?

Problems with Traditional Large-Scale Systems

Traditionally, computation has been processor-bound Relatively small amounts of data
For decades, the primary push was to increase the computing power of a single machine
Distributed systems evolved to allow developers to use multiple machines for a single job
Distributed Systems: Data Storage

Typically, data for a distributed system is stored on a SAN
At compute time, data is copied to the compute nodes
Fine for relatively limited amounts of data
Distributed Systems: Problems

Programming for traditional distributed systems is complex
Data exchange requires synchronization
Finite bandwidth is available
Temporal dependencies are complicated
It is difficult to deal with partial failures of the system
The Data-Driven World

Modern systems have to deal with far more data than was the case in the past
Organizations are generating huge amounts of data
That data has inherent value, and cannot be discarded
Examples: Facebook -over 70PB of data, EBay -over SPB of data etc

Many organizations are generating data at a rate of terabytes per day. Getting the data to the processors becomes the bottleneck.

Requirements for a New Approach

Partial Failure Support

The system must support partial failure
Failure of a component should result in a graceful degradation of application performance. Not complete failure of the entire system.
Data Recoverability

If a component of the system fails, its workload should be assumed by still-functioning units in the system
Failure should not result in the loss of any data
Component Recovery

If a component of the system fails and then recovers, it should be able to rejoin the system.
Without requiring a full restart of the entire system
Consistency

Component failures during execution of a job should not affect the outcome of the job

Scalability

Adding load to the system should result in a graceful decline in performance of individual jobs Not failure of the system

Increasing resources should support a proportional increase in load capacity.

Hadoop’s History
Hadoop is based on work done by Google in the late 1990’s to early 2000.
Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004.
This work takes a radical new approach to the problems of Distributed computing so that it meets all the requirements of reliability and availability.
This core concept is distributing the data as it is initially stored in the system.
Individual nodes can work on data local to those nodes so data cannot be transmitted over the network.
Developers need not to worry about network programming, temporal dependencies or low level infrastructure.
Nodes can talk to each other as little as possible. Developers should not write code which communicates between nodes.
Data spread among the machines in advance so that computation happens where the data is stored, wherever possible.
Data is replicated multiple times on the system for increasing availability and reliability.
When data is loaded into the system, it splits the input file into ‘blocks ” typically 64MB or 128MB.
Map tasks generally work on relatively small portions of data that is typically a single block.
A master program allocates work to nodes such that a map task will work on a block of data stored locally on that node whenever possible.
Nodes work in parallel to each of their own part of the dataset.
If a node fails, the master will detect that failure and re-assigns the work to some other node on the system.
Restarting a task does not require communication with nodes working on other portions of the data.
If failed node restarts, it is automatically add back to the system and will be assigned with new tasks.
Hadoop Overview
Hadoop consists of two core components

HDFS
MapReduce
There are many other projects based around core concepts of Hadoop. All these projects are called as Hadoop Ecosystem.

Hadoop Ecosystem has

Pig

Hive Flume

Sqoop Oozie

and so on…

A set of machines running HDFS and MapReduce is known as hadoop cluster and Individual machines are known as nodes. A cluster can have as few as one node or as many as several thousands of nodes. As the no of nodes are increased performance will be increased. The other languages except java (C++, RubyOnRails, Python, Perl etc … ) that are supported by hadoop are called as HADOOP Streaming.

HADOOP S/W AND H/W Requirements

Hadoop useally runs on opensource os’s (like linux, ubantu etc) o Centos/RHEl is mostly used in production

If we have Windows it require virtualization s/w for running other os on windows o Vm player/Vm workstation/Virtual box

Java is prerequisite for hadoop installation

For more info: https://tekslate.com/hadoop-training/

(1) (Reply)

Obtain High School Diploma At Alberta Canada And Enjoy The Easy Access To Study / Uniosun Aspirants Get In Here!!! / ASUU Suspends Strike

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 45
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.