Getting started with Spark

Spark, the Hadoop cluster-computing framework, is fully supported in Stambia as explained in the Stambia DI for Spark article.

You'll find below everything needed to begin using Spark in Stambia, from the Metadata creation to the first Mappings.

Prerequisites:

You must install the Hadoop connector to be able to work with Spark.

Please refer to the Getting started with Hadoop Connector article that will guide you to accomplish this.

Beware Spark version:
This article is about Spark 2 and above versions.
For spark 1.6, there also is a dedicated Stambia templates.
Even if spark 1.6 and 2.x are different, Stambia templates are close enough to be used the same way.

The first step, when you want to work with Spark in Stambia DI, consists of creating and configuring the Spark Metadata.

Here is a summary of the main steps to follow:

Spark connector modes
Spark metadata
1. Spark metadata creation
2. Server properties configuration
3. HDFS metadata links
Mapping examples
1. Loading and integration
2. Staging
3. Common mapping parameters

Spark connector modes

In Stambia, there is 2 way to use the Spark connector.
Both use the "Spark submit" command to run jobs, but with difference regarding Java compilation:

Local mode:
The Stambia generated Java code is compiled on the same server as Stambia runtime is running on.
When it is down, generated JARs and dependencies are sent to Spark cluster and execute with "Spark submit".

SSH mode:
The Stambia generated Java code and dependencies are sent to Spark cluster with SSH.
Then, also with SSH, java code is compiled on Spark cluster.
Finally, "Spark submit" is executed to run jobs.

Spark metadata

Spark metadata can be used as a stage (between loading and integration) to boost Hadoop mappings.

Spark metadata creation

Here is an example of a common Metadata Configuration:

metadata

The main elements of this Spark Matadata are:

The Spark Master URL: the connection to the Spark cluster, regarding your context.
You have many ways to set it: kubernates, local, yarn, spark, mesos, ...

JDK & Spark Home, work directory, ...: the path to used directories, relative to the cluster you are working one
(which can be different from the designer you are using)

Related metadata: Spark can use external metadata, such as SSH, Kerberos, HDFS, ...

You can add extra Spark parameters, such has described in Spark documentation: Spark Configuration

metadata config

You can refer external Java JAR that will be used with your Spark jobs:

metadata JARs

If you need to add extra properties, you can create your own under the root or schema nodes:

metadata extraProperties

Server properties configuration

Detailed Spark metadata configuration parameters are:

Root node

Property	Mandatory	Description	Example
Spark Master URL	Yes	Master URL used to connect to Spark	spark://<host>:<port>
JDK Home	Yes	Full path to a JDK that will be available on the server that compiles the classes. This will be used as follows: <JDK Home>/bin/javac	/opt/java
Spark Home	Yes	Home of the spark installation. This will be used to retrieve <Spark Home>/bin/spark-submit as well as <Spark Home>/jars/*.jar	/opt/spark
SSH Server	No	Drag and drop a SSH Server on this field when Spark cannot be accessed locally by the Runtime	Document Root.Cloudera
Kerberos Principal	No	Drag and drop a Kerberos Principal on this field when Spark installation is kerberized	Document Root.Kerberos Hadoop
Proxy User	No	Spark proxy-user to use to impersonate the client connection. Please refer to the Spark documentation for more information.	<Proxy user>

Data schema node

Beware: some parameters are only available for some Spark master URL.

Property	Mandatory	Description	Example	Spark Master URL
Work Directory	Yes	Full path to the directory in which the Runtime will compile the Jar files. When SSH is used, this must be a valid path on the SSH Server.	/home/stambia/stambiaRuntime/sparkWork	all
HDFS Temporary Folder	No	Drag and drop a HDFS Folder on this field. This will be used by some templates when needed to store temporary data in HDFS. This setting can be ignored if none of the templates require temporary HDFS data	Cloudera.cities	all
Driver Memory	No	Amount of memory to allocate to the Spark Driver Program. Example: 2g. Please refer to the Spark documentation for more information.	2g	all
Driver Java Options	No	Specify any additional Java Options to pass when launching the driver program. Please refer to the Spark documentation for more information.	--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.xml -Dconfig.file=app.conf"	all
Queue	No	Name of the Yarn queue to use when submitting the application.	<YARN queue name>	YARN only
Executor Memory	No	Amount of memory to use per executor process. Example: 4g. Please refer to the Spark documentation for more information.	4g	all
Executor Cores	No	The number of cores to use on each executor. Please refer to the Spark documentation for more information.	1	Spark Only
Number of Executors	No	Number of executors requested to Yarn.	1	YARN only
Total Executor Cores	No	The total number of cores that Spark can use on the cluster. Please refer to the Spark documentation for more information.	1	Mesos and Spark only

JAR File node

Property	Mandatory	Description	Example
Upload Jar file to Cluster	No	When SSH is used, this option allows to upload a local JAR File to the Cluster instead of referencing an already existing JAR File on this Cluster	Checked = Yes Unchecked = No
Path	Yes	Full Path to the Jar file on the Spark server	/home/stambia/stambiaRuntime/lib/jdbc/hsqldb.jar

HDFS metadata links

Spark can often works on HDFS data, so you can link files metadata to HDFS metadata.

For example, in below mapping, we use "personnes" file as source.
This source is a standard file Stambia metadata, but linked to an HDFS matadata with HDFS_LOCATION link.
metadata link HDFS

Mapping examples

Spark can be used in many ways inside Stambia Designer.

Each template is available in dedicated folder:

Spark 2
Spark 1.6

Loading and integration

Text files

In this example, we load 2 text files and merge them into one.
spark2 File

Sources are 2 text files, joined by "id_rep"
Target is another text file, contained both source file information
Loading steps use Spark template: Action Process LOAD Hdfs File to Spark
Integration step use Spark template: Action Process INTEGRATION Spark to Hdfs File

Hive table

Hive table as source

In this example, we load an Hive table into an SQL table.
spark2 Hive source

Sources contain an Hive table: dim_bil_type
Target is an SQL table
Loading steps use Spark template: Action Process LOAD Hive to Spark

Hive table as target

1st way: integrate without load

In this example, we load 2 SQL tables and merge them into an Hive table.
spark2 Hive target

Sources are 2 SQL tables, joined by "TIT_CODE"
Target is an Hive table
Loading steps use Spark template: Action Process LOAD Rdbms to Spark
Integration step use Spark template: Action Process INTEGRATION Spark to Hive
Note: you alsocan integrate to a more classical SQL (such as Oracle or MS-SQL) with template: Action Process INTEGRATION Spark to Rdbms

2nd way: integrate with load

You can do the same mapping with an extra Spark to Hive loading: Action Process LOAD Spark to Hive
spark2 Hive load

Note: you also can load from Spark to vertica with "Action Process LOAD Spark to Vertica"

HBase table

In this example, we load an HBase tables and into an SQL table.
spark2 HBase

Sources contains an HBase table: t_customer
Target is an SQL table
Loading steps use 2 Spark template:
- Action Process LOAD HBase to Spark
- Action Process LOAD Spark to Rdbms

Note: you also can use a Vertica table as source with "Action Process LOAD HBase to Vertica"

Staging

With Stambia, you can use the Spark staging step in two ways: SQL or Java.
Regarding the way you choose, you will fill the stage mapping field with the language you choose and Spark jobs will run with this language.

SQL

SQL stage mapping are common with Stambia, for instance with all RDBMS technology.
So, with Spark SQL, there is no difference with RDBMS mappings.

spark2 stage SQL

Sources contains 2 SQL tables
Stage is "Action Process STAGING Spark as SQL"

Java

Map

The map Spark java mode means that every 1 line in source table, there will also be 1 line at staging (1-to-1 correspondence).
The code inside "Expression Editor" is written in Java.

spark2 stage java map

Sources contains 2 SQL tables
Stage is "Action Process STAGING Spark as Java"

Note: for one-line Java code in expression editor, you can forget "return" key-word and semicolon end instruction.

Flatmap

In a Flatmap, for every 1 line in source table, you can have many lines in staging step (1-to-N correspondence).
The code inside "Expression Editor" is written in Java.

You have to return an "Iterator" interface in the mapping step.
So,ytou can use use your own objects in return they implements the "Iterator" interface.

spark2 stage java flatmap

Common mapping parameters

Every templates has is own parameters, so you can refer to Stambia Designer internal documentation to have details about every one of them.
However, here are the most common parameters:

cleanTemporaryObjects: If true, the temporary objects created during integration are removed at the end of the process.
coalesceCount: If not empty, specifies the number of partitions resulting of a coalesce() operation on the Dataset.
compile: When this option is set to true, the application is compiled and a JAR File is created.
debug: When this option is set to true, additional information about the Datasets is writen to the standard output.
executionUnit: When multiple templates share the same Execution Unit name, only one of them will submit the JAR file embedding the Java programs for all other templates.
persistStorageLevel: Allows to select a persistance of the main Dataset.
repartitionCount: If not empty, specifies the number of partitions resulting of a repartition() operation on the main Dataset.
repartitionMethod: Specifies how the data is to distributed
useDistinct: If true, duplicate records will be filtered out.
submit: When this option is set to true, the JAR file for this Execution Unit is executed using a spark-submit command.
workFolder: Specify the location where the Java files are to be generated locally before being sent to the compile directory specified in the Spark metadata.

spark