Spark, the Hadoop cluster-computing framework, is fully supported in Stambia as explained in the Stambia DI for Spark article.

You'll find below everything needed to begin using Spark in Stambia, from the Metadata creation to the first Mappings.

Prerequisites:

You must install the Hadoop connector to be able to work with Spark.

Please refer to the Getting started with Hadoop Connector article that will guide you to accomplish this.

Beware Spark version:
This article is about Spark 2 and above versions.
For spark 1.6, there also is a dedicated Stambia templates.
Even if spark 1.6 and 2.x are different, Stambia templates are close enough to be used the same way.

 

The first step, when you want to work with Spark in Stambia DI, consists of creating and configuring the Spark Metadata.

Here is a summary of the main steps to follow:

  1. Spark connector modes
  2. Spark metadata
    1. Spark metadata creation
    2. Server properties configuration
    3. HDFS metadata links
  3. Mapping examples
    1. Loading and integration
    2. Staging
    3. Common mapping parameters

 

Spark connector modes

In Stambia, there is 2 way to use the Spark connector.
Both use the "Spark submit" command to run jobs, but with difference regarding Java compilation:

  • Local mode:
    The Stambia generated Java code is compiled on the same server as Stambia runtime is running on.
    When it is down, generated JARs and dependencies are sent to Spark cluster and execute with "Spark submit".
    local connector

 

  • SSH mode:
    The Stambia generated Java code and dependencies are sent to Spark cluster with SSH.
    Then, also with SSH, java code is compiled on Spark cluster.
    Finally, "Spark submit" is executed to run jobs.
    ssh connector

 

Spark metadata

Spark metadata can be used as a stage (between loading and integration) to boost Hadoop mappings.

Spark metadata creation

Here is an example of a common Metadata Configuration:


metadata

The main elements of this Spark Matadata are:

  • The Spark Master URL: the connection to the Spark cluster, regarding your context.
    You have many ways to set it: kubernates, local, yarn, spark, mesos, ...
    metadata Spark URL

 

  • JDK & Spark Home, work directory, ...: the path to used directories, relative to the cluster you are working one
    (which can be different from the designer you are using)
    metadata Home

 

  • Related metadata: Spark can use external metadata, such as SSH, Kerberos, HDFS, ...
    metadata links

 

  • You can add extra Spark parameters, such has described in Spark documentation: Spark Configuration


metadata config

 

  • You can refer external Java JAR that will be used with your Spark jobs:


metadata JARs

 

  • If you need to add extra properties, you can create your own under the root or schema nodes:


metadata extraProperties

 

Server properties configuration

Detailed Spark metadata configuration parameters are:

Root node

Property Mandatory Description Example
Spark Master URL Yes Master URL used to connect to Spark spark://<host>:<port>
JDK Home Yes Full path to a JDK that will be available on the server that compiles the classes. This will be used as follows: <JDK Home>/bin/javac /opt/java
Spark Home Yes Home of the spark installation. This will be used to retrieve <Spark Home>/bin/spark-submit as well as <Spark Home>/jars/*.jar /opt/spark
SSH Server No Drag and drop a SSH Server on this field when Spark cannot be accessed locally by the Runtime Document Root.Cloudera
Kerberos Principal No Drag and drop a Kerberos Principal on this field when Spark installation is kerberized Document Root.Kerberos Hadoop
Proxy User No Spark proxy-user to use to impersonate the client connection. Please refer to the Spark documentation for more information. <Proxy user>

 

Data schema node

Beware: some parameters are only available for some Spark master URL.

Property Mandatory Description Example Spark Master URL
Work Directory Yes Full path to the directory in which the Runtime will compile the Jar files. When SSH is used, this must be a valid path on the SSH Server. /home/stambia/stambiaRuntime/sparkWork all
HDFS Temporary Folder No Drag and drop a HDFS Folder on this field. This will be used by some templates when needed to store temporary data in HDFS. This setting can be ignored if none of the templates require temporary HDFS data Cloudera.cities all
Driver Memory No Amount of memory to allocate to the Spark Driver Program. Example: 2g. Please refer to the Spark documentation for more information. 2g all
Driver Java Options No Specify any additional Java Options to pass when launching the driver program. Please refer to the Spark documentation for more information. --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.xml -Dconfig.file=app.conf" all
Queue No Name of the Yarn queue to use when submitting the application. <YARN queue name> YARN only
Executor Memory No Amount of memory to use per executor process. Example: 4g. Please refer to the Spark documentation for more information. 4g all
Executor Cores No The number of cores to use on each executor. Please refer to the Spark documentation for more information. 1 Spark Only
Number of Executors No Number of executors requested to Yarn. 1 YARN only
Total Executor Cores No The total number of cores that Spark can use on the cluster. Please refer to the Spark documentation for more information. 1 Mesos and Spark only

 

JAR File node

Property Mandatory Description Example
Upload Jar file to Cluster No When SSH is used, this option allows to upload a local JAR File to the Cluster instead of referencing an already existing JAR File on this Cluster Checked = Yes
Unchecked = No
Path Yes Full Path to the Jar file on the Spark server /home/stambia/stambiaRuntime/lib/jdbc/hsqldb.jar

 

HDFS metadata links

Spark can often works on HDFS data, so you can link files metadata to HDFS metadata.

For example, in below mapping, we use "personnes" file as source.
This source is a standard file Stambia metadata, but linked to an HDFS matadata with HDFS_LOCATION link.
metadata link HDFS

Mapping examples

Spark can be used in many ways inside Stambia Designer.

Each template is available in dedicated folder:

  • Spark 2
    spark2 templates
  • Spark 1.6
    spark16 templates

Loading and integration

Text files

In this example, we load 2 text files and merge them into one.
spark2 File

  • Sources are 2 text files, joined by "id_rep"
  • Target is another text file, contained both source file information
  • Loading steps use Spark template: Action Process LOAD Hdfs File to Spark
  • Integration step use Spark template: Action Process INTEGRATION Spark to Hdfs File

 

Hive table

Hive table as source

In this example, we load an Hive table into an SQL table.
spark2 Hive source

  • Sources contain an Hive table: dim_bil_type
  • Target is an SQL table
  • Loading steps use Spark template: Action Process LOAD Hive to Spark

 

Hive table as target
1st way: integrate without load

In this example, we load 2 SQL tables and merge them into an Hive table.
spark2 Hive target

  • Sources are 2 SQL tables, joined by "TIT_CODE"
  • Target is an Hive table
  • Loading steps use Spark template: Action Process LOAD Rdbms to Spark
  • Integration step use Spark template: Action Process INTEGRATION Spark to Hive
    Note: you alsocan  integrate to a more classical SQL (such as Oracle or MS-SQL) with template: Action Process INTEGRATION Spark to Rdbms

 

2nd way: integrate with load

You can do the same mapping with an extra Spark to Hive loading: Action Process LOAD Spark to Hive
spark2 Hive load

Note: you also can load from Spark to vertica with "Action Process LOAD Spark to Vertica"

 

HBase table

In this example, we load an HBase tables and into an SQL table.
spark2 HBase

  • Sources contains an HBase table: t_customer
  • Target is an SQL table
  • Loading steps use 2 Spark template:
    • Action Process LOAD HBase to Spark
    • Action Process LOAD Spark to Rdbms

Note: you also can use a Vertica table as source with "Action Process LOAD HBase to Vertica"

 

Staging

With Stambia, you can use the Spark staging step in two ways: SQL or Java.
Regarding the way you choose, you will fill the stage mapping field with the language you choose and Spark jobs will run with this language.

 

SQL

SQL stage mapping are common with Stambia, for instance with all RDBMS technology.
So, with Spark SQL, there is no difference with RDBMS mappings.

spark2 stage SQL

  • Sources contains 2 SQL tables
  • Stage is "Action Process STAGING Spark as SQL"

 

Java

Map

The map Spark java mode means that every 1 line in source table, there will also be 1 line at staging (1-to-1 correspondence).
The code inside "Expression Editor" is written in Java.

spark2 stage java map

  • Sources contains 2 SQL tables
  • Stage is "Action Process STAGING Spark as Java"

Note: for one-line Java code in expression editor, you can forget "return" key-word and semicolon end instruction.

 

Flatmap

In a Flatmap, for every 1 line in source table, you can have many lines in staging step (1-to-N correspondence).
The code inside "Expression Editor" is written in Java.

You have to return an "Iterator" interface in the mapping step.
So,ytou can use use your own objects in return they implements the "Iterator" interface.

spark2 stage java flatmap

 

Common mapping parameters

Every templates has is own parameters, so you can refer to Stambia Designer internal documentation to have details about every one of them.
However, here are the most common parameters:

  • cleanTemporaryObjects: If true, the temporary objects created during integration are removed at the end of the process.
  • coalesceCount: If not empty, specifies the number of partitions resulting of a coalesce() operation on the Dataset.
  • compile: When this option is set to true, the application is compiled and a JAR File is created.
  • debug: When this option is set to true, additional information about the Datasets is writen to the standard output.
  • executionUnit: When multiple templates share the same Execution Unit name, only one of them will submit the JAR file embedding the Java programs for all other templates.
  • persistStorageLevel: Allows to select a persistance of the main Dataset.
  • repartitionCount: If not empty, specifies the number of partitions resulting of a repartition() operation on the main Dataset.
  • repartitionMethod: Specifies how the data is to distributed
  • useDistinct: If true, duplicate records will be filtered out.
  • submit: When this option is set to true, the JAR file for this Execution Unit is executed using a spark-submit command.
  • workFolder: Specify the location where the Java files are to be generated locally before being sent to the compile directory specified in the Spark metadata.