Install Hadoop on Mac – Ultimate Step by Step Guide

Posted by Marta on October 2, 2021 Viewed 151 times

Learn how to install the hadoop system in your mac machine, including a brief reminder of what hadoop is and its architecture.

Card image cap

Hey there! In this tutorial you will learn how to install the hadoop system in your mac machine, including a brief reminder of what hadoop is and its architecture.

When talking about hadoop, first idea that probably crosses your mind is big data. Hadoop emerged with big data as there was a need to store massive amount of data. Not only store it, but analyse it and access it in a reliable, scalable and affordable manner.

The hadoop system solves two key big data problems. First problem, what if one of the computer fails? Traditionally if a machine fails, all information stored is lost, unless there is a backup. The hadoop system have mechanisms to avoid this problem.

The second challenging problem was combining the information from different hard drives. When you are analysing large amount of data saved in many hard drives, accessing and combining this information can be a challenging. Fortunately, Hadoop also tackles this issue.

What is Hadoop?

Hadoop is an open source software optimised for reliable and scalable distributed computing. What does that means? Distributed computing means that instead of a single computer carrying out a processing task, the task is performed by several machines. Multiple computers, all connected together to attempt one goal.

The Hadoop software includes mechanisms that avoid data loss. And It is a scalable system, meaning more computers can be added to the system as the data grows.

Hadoop is designed to handle large files. Once you store a file in Hadoop, the file will be split in smaller pieces and each piece stored in different a machine within the cluster. Plus each file block is replicated in several machines to avoid data loss.

The whole system can be scaled up from one server to thousand of servers. As a result, the computation and storage power of each server is combined resulting in a really powerful system.

Hadoop architecture

The hadoop system is divided in two parts: map reduce and HDFS. The mapreduce is a processing paradigm.

The map reduce paradigm consists in splitting the processing work into fixed size chunks, and assign a file block to each processing unit. Each processing unit is independent and might execute in a different server and access a different block. After that, the results from independent processes are combined.

HDFS is a storage paradigm which breaks large files into smaller blocks( 108M by default). These blocks are distributed across different computers, and also replicated. Therefore you might have the same block in three different servers, which prevents data loss.

In the HDFS architecture there are two types of nodes or servers: the name nodes and the data nodes. There is usually one active name node that keeps track of all the blocks and where those blocks live. And the data nodes which contains the actual blocks, and it is what the client application will be talking, to access a file, once it has located the block.

Different Installation Modes

In this tutorial you will learn how to install hadoop in a single machine. There are three different hadoop configuration modes:

  • Standalone: using this mode no separate daemons will be running. Everything runs in a single JVM on a single server. This is the default configuration
  • Pseudo distributed mode: Hadoop daemons will be running to replicate a cluster communication, however all is running in the same server.
  • Fully distributed mode: Hadoop daemons run a cluster of machines system.

There are basically three daemons types in hadoop. The hdfs daemons which are the namenode, secondary namenode and the datanodes. The YARN daemons which are the resource manager and node manager. And the mapreduce daemon.

Next you will see what you need to do to install hadoop in standalone mode, and also in pseudo distributed mode. 

Standalone installation

As we have seen, you can install hadoop on mac in three different mode. One of them is standalone. Standalone means that there are no separated daemons, all hadoop processes are running on the same JVM. See below the steps to install hadoop in standalone mode:

Step 1) Check Java is installed

Hadoop is a software written in Java, and behind the scenes is using Java. Therefore first thing to do is indicating to hadoop where Java is installed. To do so, you need to set up the JAVA_HOME enviroment variable on your machine.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-14.0.2.jdk/Contents/Home/

Then check that java is installed, by running the following command:

>> java --version

Step 2) Download hadoop

You can download hadoop from the following website: Download Hadoop

Select any of the mirrors, then pick a version and download the file called hadoop-X.Y.Z.tar.gz.

Then you need to uncompress the file in any folder. Make a note of where you uncompress the file, since you will need the path for next step.

This directory contains all the necessary scripts for running the hadoop daemons.

Step 3) Set up Hadoop environment variables

Next step to install hadoop on mac is creating the hadoop environment variables. You will need to create the HADOOP_HOME environment variable, which will point to the directory where you uncompressed the previous file. You should also add this path the your PATH variable. See below the command you should execute:

>> export HADOOP_HOME=~/sw/hadoop-x.y.z
>> export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

You can include this command in your .bash_profile and the hadoop environment variable so it is permanently set up.

This is all you need to do to install hadoop in standalone mode. You can check that hadoop is correctly installed by executing the following command from your terminal:

>> hadoop version

Output:

Hadoop 2.10.0
Subversion ssh://git.corp.linkedin.com:29418/hadoop/hadoop.git -r e2f1f118e465e787d8567dfa6e2f3b72a0eb9194
Compiled by jhung on 2019-10-22T19:10Z
Compiled with protoc 2.5.0
From source with checksum 7b2d8877c5ce8c9a2cca5c7e81aa4026
This command was run using /Users/martarey/dev/hadoop-2.10.0/share/hadoop/common/hadoop-common-2.10.0.jar

If you like to run hadoop in pseudo distributed mode, so your hadoop local installation is closer to a fully distributed installation, keep reading. There are a few more steps to carry out.

Pseudodistributed installation

To run hadoop in pseudo distributed mode, there is some extra steps to carry out. In this mode, the different processes will be running locally, however will behave as if executing in a distributed system. This means that hadoop will assume that it needs to connect to a remote server to start the daemons, even though it will connect to localhost (your machine)

Step 4) Configuration files

There are 5 properties that you have to modify to get your hadoop to function on pseudo distrubuted mode. See the properties below:

  • fs.defaultFS : common component property.
  • dfs.replication : indicates the factor by which hadoop replicates file blocks.
  • yarn.resourcemanager.hostname and yarn.nodemanager.aux-services which determines the machine that is set as the resource manager. And a list of the auxiliar services run by the node manager.
  • mapreduce.framework.name : which controls map reduce.

These properties are set within xml files. You can find these files within the hadoop directory, in the etc/hadoop folder. Here are the files that you should modify and the content of each:

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost/</value>
  </property>
</configuration>

mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Step 5) Configure SSH

In pseudo distributed mode, hadoop act as if running in a cluster. Therefore, to start the daemons, hadoop will connect to the different servers in the cluster using ssh, though in pseudo distributed mode all daemons will be running locally.

As a result, you need to make sure that ssh is installed in your machine and enable passwordless login. In mac, you can check if ssh is installed by checking the Remote Login Settings. Go to Preferences -> Sharing and make sure Remote Login is enabled:

And next, let’s run the following command to create an ssh key and set it as authorized. This way no password is necessary when SSHing.

>> ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
>> cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now, you can check that you configure ssh correctly, by running the following command, which check that you can connect to localhost without entering any password:

>> ssh localhost

Step 6) Formatting the HDFS Filesystem

Before you can HDFS, you will need to format the file system. You can format the filesystem by running the following command from your terminal:

>> hdfs namenode -format

Step 7) Start the daemons

And the final step is starting the HDFS, Yarn and MapReduce daemons. You can start them by running the following commands from your terminal:

>> start-dfs.sh
>> start-yarn.sh
>> mr-jobhistory-daemon.sh start historyserver

In case the commands are not recognised, go back to step 3 and make sure you added the sbin hadoop directory to your PATH variable:

 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

You can check that the daemons are running and the hadoop ecosystem is running correctly by opening the hadoop web UIs from your browser:

  • NameNode Web UI: http://localhost:50070
  • ResourceManager Web UI: http://localhost:8088
  • History Server UI: http://localhost:19888

The following commands will stop the daemons:

>> stop-dfs.sh
>> stop-yarn.sh
>> mr-jobhistory-daemon.sh stop historyserver

Perfect! You have installed hadoop in your machine. Installing hadoop manually is a good way to get better understanding of the different components that are executing in a hadoop system.

However in case you are planning to work a lot with hadoop, my advice will be to install in a virtual machine, so it is isolated. This way you can avoid changing environment variables while doing something else and messing your hadoop environment. In the next section, you will learn to install hadoop in a virtual box.

Install hadoop in a virtual box

Installing hadoop in a virtual box on mac is another great good way to use hadoop. The advantages of using a virtual box is that the environment is isolated, so you won’t unconfigure the environment by updating the java version, for instance. Plus Cloudera provides a hadoop distribution that, not only includes hadoop, but also all other components that will need when working with hadoop: Apache Avro, Apache Crunch, Apache DataFu, Apache Flume, Apache HBase, Apache Hive, Hue, Apache Mahout, Apache Zookeeper and a few more.

Step 1) Install Virtual Box

First you need to install Virtual Box in your machine. In this instance, we will use a linux environment. To do so, head to :

https://www.virtualbox.org/

Then click on download, and select your operating system:

Step 2) Download Hadoop image

Next, you will need to download the the “Hortonworks Sandbox” which is basically a hadoop image. To do so, go to the following page:

https://www.cloudera.com/downloads/hortonworks-sandbox.html

and download the “Hortonworks HDP”, and next select “Virtual Box” as the installation type.

This will trigger the image download. the hadoop image will be downloaded in a .ova file.

Next you will need to import the image into the virtual box. To start the environment, just select the environment and click “start”. Once the virtual box starts, you will have a preinstalled hadoop environment with a bunch of associated technologies, like Hive, Spark and more.

Conclusion

To summarise, we have seen that hadoop optimises dealing with big data. The hadoop ecosystem has several components like HDFS and MapReduce which makes the platform reliable and scalable. Additionally, we have seen how to install hadoop on mac manually and how to install it on a virtual box.

More Interesting Articles

Automate Data Entry – How to Create a Selenium bot

Creating a Speech Recognition Program with Python & Google API

What is the purpose of the python class initializer?

Python Indentation Error Simply Explained

How to Fix Java Error – Identifier Expected

How to shuffle a string in java

Project-Based Programming Introduction

Steady pace book with lots of worked examples. Starting with the basics, and moving to projects, data visualisation, and web applications

100% Recommended book for Java Beginners

Unique lay-out and teaching programming style helping new concepts stick in your memory

90 Specific Ways to Write Better Python

Great guide for those who want to improve their skills when writing python code. Easy to understand. Many practical examples

Grow Your Java skills as a developer

Perfect Boook for anyone who has an alright knowledge of Java and wants to take it to the next level.

Write Code as a Professional Developer

Excellent read for anyone who already know how to program and want to learn Best Practices

Every Developer should read this

Perfect book for anyone transitioning into the mid/mid-senior developer level

Great preparation for interviews

Great book and probably the best way to practice for interview. Some really good information on how to perform an interview. Code Example in Java