Posted by Marta on February 2, 2023 Viewed 13001 times
This article is an Apache Spark Java Complete Tutorial, where you will learn how to write a simple Spark application. No previous knowledge of Apache Spark is required to follow this guide. Our Spark application will find out the most popular words in US Youtube Video Titles.
Firstly, I have introduced Apache Spark, its history, what it is, and how it works. And to continue, you will see how to write a simple Spark application.
The history of Apache Spark emerged at UC Berkeley, where a group of researchers acknowledges the lack of interactivity of the MapReduce jobs. Depending on the dataset’s size, a large MapReduce job could take hours or even days to complete. Additionally, the whole Hadoop and MapReduce ecosystem was complicated and challenging to learn.
Apache Hadoop framework was an excellent solution for distributed systems introducing parallelism programming paradigm on distributed datasets. The worker nodes of a cluster will execute computations and aggregates results, producing an outcome. However, Hadoop has a few shortcomings: the system involves an elaborate set-up and not quite interactive. Fortunately, Apache Spark brought simplicity and speed to the picture.
Apache Spark is a computational engine that can schedule and distribute an application computation consisting of many tasks. Meaning your computation tasks or application won’t execute sequentially on a single machine. Instead, Apache Spark will split the computation into separate smaller tasks and run them in different servers within the cluster. Therefore, maximizing the power of parallelism.
Another critical improvement over Hadoop is speed. Using in-memory storage for intermediate computation results makes Apache Spark much faster than Hadoop MapReduce.
Apache Spark uses a master-slave architecture, meaning one node coordinates the computations that will execute in the other nodes.
The master node is the central coordinator which will run the driver program. The driver program will split a Spark job is smaller tasks and execute them across many distributed workers. The driver program will communicate with the distributed worker nodes through a SparkSession.
There are ways to install and execute a Spark application using different configurations. You could configure Spark to run the driver program and executor in the same single JVM in a laptop, different JVMs, or different JVMs across a cluster
In this tutorial, we will see the local configuration, which means, as mentioned before, the driver program, spark executors, and cluster manager will run all in the same JVM.
Before you start, you will need to check you have installed Java, version 8 or higher, and Maven on your machine. Check by opening the terminal and execute the following commands:
>> java -version >> mvn -version
We will use java to write our Spark job example and Maven to manage the libraries of our program. In this case, Spark Core is the main dependency to download.
Once you checked Java and Maven are installed on your laptop, the first step is creating the project where our code will go. To do so, we will use maven. Create the project by opening your terminal and running the following command:
>>mvn archetype:generate -DgroupId=com.example.app -DartifactId=spark-example -DarchetypeArtifactId=maven-archetype-quickstart
The previous command will create a new directory containing an empty project. Next, we will need to modify some of the files. Therefore you will need to open this project with your preferred IDE. In my case, I will be using IntelliJ IDEA, which is a great IDE that I fully recommend.
At this point, we have created a project and open it. Next, you will need to include the Spark dependency in your project to get access to the Spark functionality. You can do so by opening the pom.xml
file and the following within the <dependencies>
tag:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.0.1</version> </dependency>
Adding the previous snippet will indicate to Maven pull all necessary dependencies to write a Spark application.
Now that you set up your project, we will need a dataset to analyze. In this case, we will use a dataset that contains US Youtube videos and information like the number of views, likes, dislikes, etc. However, since we want to know the most popular title word, we will focus on the title.
To download the dataset, go to www.kaggle.com site using the link below and select the USvideos.csv file, and click on download:
Kaggle is a site that contains a vast number of datasets, convenient for experimenting with big data and machine learning. It is entirely free to use.
Once you have download the USvideos.csv file, save it in a directory within your project. In my case, I have kept it in the directory data/youtube
:
And finally, we arrive at the last step of the Apache Spark Java Tutorial, writing the code of the Apache Spark Java program. So far, we create the project and download a dataset, so you are ready to write a spark program that analyses this data. Specifically, we will find out the most frequently used words in trending youtube titles.
To create the Spark application, I will make a class called YoutubeTitleWordCount
and add our code within the main method. Our code will carry out the following steps:
See how to convert these steps to Java code:
import org.apache.commons.lang.StringUtils; import org.apache.log4j.Level; import org.apache.log4j.Logger; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import java.io.IOException; import java.util.Arrays; import java.util.List; import java.util.Map; import java.util.stream.Collectors; public class YoutubeTitleWordCount { private static final String COMMA_DELIMITER = ","; public static void main(String[] args) throws IOException { Logger.getLogger("org").setLevel(Level.ERROR); // CREATE SPARK CONTEXT SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]"); JavaSparkContext sparkContext = new JavaSparkContext(conf); // LOAD DATASETS JavaRDD<String> videos = sparkContext.textFile("data/youtube/USvideos.csv"); // TRANSFORMATIONS JavaRDD<String> titles =videos .map(YoutubeTitleWordCount::extractTitle) .filter(StringUtils::isNotBlank); JavaRDD<String> words = titles.flatMap( title -> Arrays.asList(title .toLowerCase() .trim() .replaceAll("\\p{Punct}","") .split(" ")).iterator()); // COUNTING Map<String, Long> wordCounts = words.countByValue(); List<Map.Entry> sorted = wordCounts.entrySet().stream() .sorted(Map.Entry.comparingByValue()).collect(Collectors.toList()); // DISPLAY for (Map.Entry<String, Long> entry : sorted) { System.out.println(entry.getKey() + " : " + entry.getValue()); } } public static String extractTitle(String videoLine){ try { return videoLine.split(COMMA_DELIMITER)[2]; }catch (ArrayIndexOutOfBoundsException e){ return ""; } } }
At line 20, you will indicate the configuration to use, which in this case is local, using 3 CPU cores:
SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
Another important aspect is the RDD concept. RDD stands for Resilient Distributed Dataset, and it is only an encapsulation of an extensive dataset. In Spark, all work is expressed through RDDs, either doing transformations such as filters or calling actions to compute results.
Check out the source code here
This article was an Apache Spark Java tutorial to help you to get started with Apache Spark. Apache Spark is a distributed computing engine that makes extensive dataset computation easier and faster by taking advantage of parallelism and distributed systems. Plus, we have seen how to create a simple Apache Spark Java program.
I hope you enjoy this article, and thank you so much for reading and supporting this blog!
Steady pace book with lots of worked examples. Starting with the basics, and moving to projects, data visualisation, and web applications
Unique lay-out and teaching programming style helping new concepts stick in your memory
Great guide for those who want to improve their skills when writing python code. Easy to understand. Many practical examples
Perfect Boook for anyone who has an alright knowledge of Java and wants to take it to the next level.
Excellent read for anyone who already know how to program and want to learn Best Practices
Perfect book for anyone transitioning into the mid/mid-senior developer level
Great book and probably the best way to practice for interview. Some really good information on how to perform an interview. Code Example in Java