Posted by Marta on March 27, 2022 Viewed 9157 times
Hey there! In this tutorial you will learn how to install the hadoop system in your mac machine, including a brief reminder of what hadoop is and its architecture.
When talking about hadoop, first idea that probably crosses your mind is big data. Hadoop emerged with big data as there was a need to store massive amount of data. Not only store it, but analyse it and access it in a reliable, scalable and affordable manner.
The hadoop system solves two key big data problems. First problem, what if one of the computer fails? Traditionally if a machine fails, all information stored is lost, unless there is a backup. The hadoop system have mechanisms to avoid this problem.
The second challenging problem was combining the information from different hard drives. When you are analysing large amount of data saved in many hard drives, accessing and combining this information can be a challenging. Fortunately, Hadoop also tackles this issue.
Watch this tutorial on Youtube
Hadoop is an open source software optimised for reliable and scalable distributed computing. What does that means? Distributed computing means that instead of a single computer carrying out a processing task, the task is performed by several machines. Multiple computers, all connected together to attempt one goal.
The Hadoop software includes mechanisms that avoid data loss. And It is a scalable system, meaning more computers can be added to the system as the data grows.
Hadoop is designed to handle large files. Once you store a file in Hadoop, the file will be split in smaller pieces and each piece stored in different a machine within the cluster. Plus each file block is replicated in several machines to avoid data loss.
The whole system can be scaled up from one server to thousand of servers. As a result, the computation and storage power of each server is combined resulting in a really powerful system.
The hadoop system is divided in two parts: map reduce and HDFS. The mapreduce is a processing paradigm.
The map reduce paradigm consists in splitting the processing work into fixed size chunks, and assign a file block to each processing unit. Each processing unit is independent and might execute in a different server and access a different block. After that, the results from independent processes are combined.
HDFS is a storage paradigm which breaks large files into smaller blocks( 108M by default). These blocks are distributed across different computers, and also replicated. Therefore you might have the same block in three different servers, which prevents data loss.
In the HDFS architecture there are two types of nodes or servers: the name nodes and the data nodes. There is usually one active name node that keeps track of all the blocks and where those blocks live. And the data nodes which contains the actual blocks, and it is what the client application will be talking, to access a file, once it has located the block.
In this tutorial you will learn how to install hadoop in a single machine. There are three different hadoop configuration modes:
There are basically three daemons types in hadoop. The hdfs daemons which are the namenode, secondary namenode and the datanodes. The YARN daemons which are the resource manager and node manager. And the mapreduce daemon.
Next you will see what you need to do to install hadoop in standalone mode, and also in pseudo distributed mode.
As we have seen, you can install hadoop on mac in three different mode. One of them is standalone. Standalone means that there are no separated daemons, all hadoop processes are running on the same JVM. See below the steps to install hadoop in standalone mode:
Hadoop is a software written in Java, and behind the scenes is using Java. Therefore first thing to do is indicating to hadoop where Java is installed. To do so, you need to set up the
JAVA_HOME enviroment variable on your machine.
Then check that java is installed, by running the following command:
>> java --version
You can download hadoop from the following website: Download Hadoop
Select any of the mirrors, then pick a version and download the file called hadoop-X.Y.Z.tar.gz.
Then you need to uncompress the file in any folder. Make a note of where you uncompress the file, since you will need the path for next step.
This directory contains all the necessary scripts for running the hadoop daemons.
Next step to install hadoop on mac is creating the hadoop environment variables. You will need to create the
HADOOP_HOME environment variable, which will point to the directory where you uncompressed the previous file. You should also add this path the your PATH variable. See below the command you should execute:
>> export HADOOP_HOME=~/sw/hadoop-x.y.z >> export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
You can include this command in your
.bash_profile and the hadoop environment variable so it is permanently set up.
This is all you need to do to install hadoop in standalone mode. You can check that hadoop is correctly installed by executing the following command from your terminal:
>> hadoop version
Hadoop 2.10.0 Subversion ssh://git.corp.linkedin.com:29418/hadoop/hadoop.git -r e2f1f118e465e787d8567dfa6e2f3b72a0eb9194 Compiled by jhung on 2019-10-22T19:10Z Compiled with protoc 2.5.0 From source with checksum 7b2d8877c5ce8c9a2cca5c7e81aa4026 This command was run using /Users/martarey/dev/hadoop-2.10.0/share/hadoop/common/hadoop-common-2.10.0.jar
If you like to run hadoop in pseudo distributed mode, so your hadoop local installation is closer to a fully distributed installation, keep reading. There are a few more steps to carry out.
To run hadoop in pseudo distributed mode, there is some extra steps to carry out. In this mode, the different processes will be running locally, however will behave as if executing in a distributed system. This means that hadoop will assume that it needs to connect to a remote server to start the daemons, even though it will connect to
localhost (your machine)
There are 5 properties that you have to modify to get your hadoop to function on pseudo distrubuted mode. See the properties below:
fs.defaultFS: common component property.
dfs.replication: indicates the factor by which hadoop replicates file blocks.
yarn.nodemanager.aux-serviceswhich determines the machine that is set as the resource manager. And a list of the auxiliar services run by the node manager.
mapreduce.framework.name: which controls map reduce.
These properties are set within xml files. You can find these files within the hadoop directory, in the
etc/hadoop folder. Here are the files that you should modify and the content of each:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost/</value> </property> </configuration>
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>localhost</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
In pseudo distributed mode, hadoop act as if running in a cluster. Therefore, to start the daemons, hadoop will connect to the different servers in the cluster using
ssh, though in pseudo distributed mode all daemons will be running locally.
As a result, you need to make sure that
ssh is installed in your machine and enable passwordless login. In mac, you can check if
ssh is installed by checking the Remote Login Settings. Go to Preferences -> Sharing and make sure Remote Login is enabled:
And next, let’s run the following command to create an ssh key and set it as authorized. This way no password is necessary when SSHing.
>> ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa >> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now, you can check that you configure ssh correctly, by running the following command, which check that you can connect to localhost without entering any password:
>> ssh localhost
Before you can HDFS, you will need to format the file system. You can format the filesystem by running the following command from your terminal:
>> hdfs namenode -format
And the final step is starting the HDFS, Yarn and MapReduce daemons. You can start them by running the following commands from your terminal:
>> start-dfs.sh >> start-yarn.sh >> mr-jobhistory-daemon.sh start historyserver
In case the commands are not recognised, go back to step 3 and make sure you added the
sbin hadoop directory to your PATH variable:
You can check that the daemons are running and the hadoop ecosystem is running correctly by opening the hadoop web UIs from your browser:
The following commands will stop the daemons:
>> stop-dfs.sh >> stop-yarn.sh >> mr-jobhistory-daemon.sh stop historyserver
Perfect! You have installed hadoop in your machine. Installing hadoop manually is a good way to get better understanding of the different components that are executing in a hadoop system.
However in case you are planning to work a lot with hadoop, my advice will be to install in a virtual machine, so it is isolated. This way you can avoid changing environment variables while doing something else and messing your hadoop environment. In the next section, you will learn to install hadoop in a virtual box.
Installing hadoop in a virtual box on mac is another great good way to use hadoop. The advantages of using a virtual box is that the environment is isolated, so you won’t unconfigure the environment by updating the java version, for instance. Plus Cloudera provides a hadoop distribution that, not only includes hadoop, but also all other components that will need when working with hadoop: Apache Avro, Apache Crunch, Apache DataFu, Apache Flume, Apache HBase, Apache Hive, Hue, Apache Mahout, Apache Zookeeper and a few more.
First you need to install Virtual Box in your machine. In this instance, we will use a linux environment. To do so, head to :
Then click on download, and select your operating system:
Next, you will need to download the the “Hortonworks Sandbox” which is basically a hadoop image. To do so, go to the following page:
and download the “Hortonworks HDP”, and next select “Virtual Box” as the installation type.
This will trigger the image download. the hadoop image will be downloaded in a
Next you will need to import the image into the virtual box. To start the environment, just select the environment and click “start”. Once the virtual box starts, you will have a preinstalled hadoop environment with a bunch of associated technologies, like Hive, Spark and more.
To summarise, we have seen that hadoop optimises dealing with big data. The hadoop ecosystem has several components like HDFS and MapReduce which makes the platform reliable and scalable. Additionally, we have seen how to install hadoop on mac manually and how to install it on a virtual box.
Steady pace book with lots of worked examples. Starting with the basics, and moving to projects, data visualisation, and web applications
Unique lay-out and teaching programming style helping new concepts stick in your memory
Great guide for those who want to improve their skills when writing python code. Easy to understand. Many practical examples
Perfect Boook for anyone who has an alright knowledge of Java and wants to take it to the next level.
Excellent read for anyone who already know how to program and want to learn Best Practices
Perfect book for anyone transitioning into the mid/mid-senior developer level
Great book and probably the best way to practice for interview. Some really good information on how to perform an interview. Code Example in Java