Apache Spark Java Tutorial: Simplest Guide to Get Started

Posted by Marta on February 2, 2023 Viewed 13001 times

Card image cap

This article is an Apache Spark Java Complete Tutorial, where you will learn how to write a simple Spark application. No previous knowledge of Apache Spark is required to follow this guide. Our Spark application will find out the most popular words in US Youtube Video Titles.

Firstly, I have introduced Apache Spark, its history, what it is, and how it works. And to continue, you will see how to write a simple Spark application.

History of Apache Spark

The history of Apache Spark emerged at UC Berkeley, where a group of researchers acknowledges the lack of interactivity of the MapReduce jobs. Depending on the dataset’s size, a large MapReduce job could take hours or even days to complete. Additionally, the whole Hadoop and MapReduce ecosystem was complicated and challenging to learn.

Apache Hadoop framework was an excellent solution for distributed systems introducing parallelism programming paradigm on distributed datasets. The worker nodes of a cluster will execute computations and aggregates results, producing an outcome. However, Hadoop has a few shortcomings: the system involves an elaborate set-up and not quite interactive. Fortunately, Apache Spark brought simplicity and speed to the picture.

What is Apache Spark?

Apache Spark is a computational engine that can schedule and distribute an application computation consisting of many tasks. Meaning your computation tasks or application won’t execute sequentially on a single machine. Instead, Apache Spark will split the computation into separate smaller tasks and run them in different servers within the cluster. Therefore, maximizing the power of parallelism.

Another critical improvement over Hadoop is speed. Using in-memory storage for intermediate computation results makes Apache Spark much faster than Hadoop MapReduce.

Architecture with examples

Apache Spark uses a master-slave architecture, meaning one node coordinates the computations that will execute in the other nodes.

The master node is the central coordinator which will run the driver program. The driver program will split a Spark job is smaller tasks and execute them across many distributed workers. The driver program will communicate with the distributed worker nodes through a SparkSession.

Set Up Spark Java Program

There are ways to install and execute a Spark application using different configurations. You could configure Spark to run the driver program and executor in the same single JVM in a laptop, different JVMs, or different JVMs across a cluster

In this tutorial, we will see the local configuration, which means, as mentioned before, the driver program, spark executors, and cluster manager will run all in the same JVM.

Before you start, you will need to check you have installed Java, version 8 or higher, and Maven on your machine. Check by opening the terminal and execute the following commands:

>> java -version
>> mvn -version

We will use java to write our Spark job example and Maven to manage the libraries of our program. In this case, Spark Core is the main dependency to download.

Step 1) Create the project

Once you checked Java and Maven are installed on your laptop, the first step is creating the project where our code will go. To do so, we will use maven. Create the project by opening your terminal and running the following command:

>>mvn archetype:generate -DgroupId=com.example.app -DartifactId=spark-example -DarchetypeArtifactId=maven-archetype-quickstart

The previous command will create a new directory containing an empty project. Next, we will need to modify some of the files. Therefore you will need to open this project with your preferred IDE. In my case, I will be using IntelliJ IDEA, which is a great IDE that I fully recommend.

Step 2) Add Spark dependencies.

At this point, we have created a project and open it. Next, you will need to include the Spark dependency in your project to get access to the Spark functionality. You can do so by opening the pom.xml file and the following within the <dependencies> tag:

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.0.1</version>
    </dependency>

Adding the previous snippet will indicate to Maven pull all necessary dependencies to write a Spark application.

Step 3) Download the dataset.

Now that you set up your project, we will need a dataset to analyze. In this case, we will use a dataset that contains US Youtube videos and information like the number of views, likes, dislikes, etc. However, since we want to know the most popular title word, we will focus on the title.

To download the dataset, go to www.kaggle.com site using the link below and select the USvideos.csv file, and click on download:

Download the dataset

Kaggle is a site that contains a vast number of datasets, convenient for experimenting with big data and machine learning. It is entirely free to use.

Once you have download the USvideos.csv file, save it in a directory within your project. In my case, I have kept it in the directory data/youtube:

Write an Apache Spark Java Program

And finally, we arrive at the last step of the Apache Spark Java Tutorial, writing the code of the Apache Spark Java program. So far, we create the project and download a dataset, so you are ready to write a spark program that analyses this data. Specifically, we will find out the most frequently used words in trending youtube titles.

To create the Spark application, I will make a class called YoutubeTitleWordCount and add our code within the main method. Our code will carry out the following steps:

  • Create a Spark Context, which is the entry point to the Spark core functionality.
  • Loading the dataset as an RDD. The RDD is a Spark Core abstraction for working with data.
  • Running some transformation, extract all titles, remove the rest of fields, lower case and remove any punctuation from the titles, and split them into words.
  • Counting the word occurrences, sort them, and print them.

See how to convert these steps to Java code:

import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class YoutubeTitleWordCount {
    private static final String COMMA_DELIMITER = ",";
    public static void main(String[] args) throws IOException {

        Logger.getLogger("org").setLevel(Level.ERROR);
		// CREATE SPARK CONTEXT
        SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
        JavaSparkContext sparkContext = new JavaSparkContext(conf);
      
        // LOAD DATASETS
        JavaRDD<String> videos = sparkContext.textFile("data/youtube/USvideos.csv");

        // TRANSFORMATIONS
        JavaRDD<String> titles =videos
                .map(YoutubeTitleWordCount::extractTitle)
                .filter(StringUtils::isNotBlank);

        JavaRDD<String> words = titles.flatMap( title -> Arrays.asList(title
                .toLowerCase()
                .trim()
                .replaceAll("\\p{Punct}","")
                .split(" ")).iterator());
		
        // COUNTING
        Map<String, Long> wordCounts = words.countByValue();
        List<Map.Entry> sorted = wordCounts.entrySet().stream()
                .sorted(Map.Entry.comparingByValue()).collect(Collectors.toList());
      
        // DISPLAY
        for (Map.Entry<String, Long> entry : sorted) {
            System.out.println(entry.getKey() + " : " + entry.getValue());
        }

    }


    public static String extractTitle(String videoLine){
        try {
            return videoLine.split(COMMA_DELIMITER)[2];
        }catch (ArrayIndexOutOfBoundsException e){
            return "";
        }
    }
}

At line 20, you will indicate the configuration to use, which in this case is local, using 3 CPU cores:

 SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");

Another important aspect is the RDD concept. RDD stands for Resilient Distributed Dataset, and it is only an encapsulation of an extensive dataset. In Spark, all work is expressed through RDDs, either doing transformations such as filters or calling actions to compute results.

Check out the source code here

Conclusion

This article was an Apache Spark Java tutorial to help you to get started with Apache Spark. Apache Spark is a distributed computing engine that makes extensive dataset computation easier and faster by taking advantage of parallelism and distributed systems. Plus, we have seen how to create a simple Apache Spark Java program.

I hope you enjoy this article, and thank you so much for reading and supporting this blog!

More Interesting Articles

Project-Based Programming Introduction

Steady pace book with lots of worked examples. Starting with the basics, and moving to projects, data visualisation, and web applications

100% Recommended book for Java Beginners

Unique lay-out and teaching programming style helping new concepts stick in your memory

90 Specific Ways to Write Better Python

Great guide for those who want to improve their skills when writing python code. Easy to understand. Many practical examples

Grow Your Java skills as a developer

Perfect Boook for anyone who has an alright knowledge of Java and wants to take it to the next level.

Write Code as a Professional Developer

Excellent read for anyone who already know how to program and want to learn Best Practices

Every Developer should read this

Perfect book for anyone transitioning into the mid/mid-senior developer level

Great preparation for interviews

Great book and probably the best way to practice for interview. Some really good information on how to perform an interview. Code Example in Java