Spark, the Hadoop cluster-computing framework, is fully supported in Stambia as explained in the Stambia DI for Spark article.
You'll find below everything needed to begin using Spark in Stambia, from the Metadata creation to the first Mappings.
Prerequisites:
You must install the Hadoop connector to be able to work with Spark.
Please refer to the Getting started with Hadoop Connector article that will guide you to accomplish this.
Beware Spark version:
This article is about Spark 2 and above versions.
For spark 1.6, there also is a dedicated Stambia templates.
Even if spark 1.6 and 2.x are different, Stambia templates are close enough to be used the same way.
The first step, when you want to work with Spark in Stambia DI, consists of creating and configuring the Spark Metadata.
Here is a summary of the main steps to follow:
- Spark connector modes
- Spark metadata
- Spark metadata creation
- Server properties configuration
- HDFS metadata links
- Mapping examples
- Loading and integration
- Staging
- Common mapping parameters
Spark connector modes
In Stambia, there is 2 way to use the Spark connector.
Both use the "Spark submit" command to run jobs, but with difference regarding Java compilation:
- Local mode:
The Stambia generated Java code is compiled on the same server as Stambia runtime is running on.
When it is down, generated JARs and dependencies are sent to Spark cluster and execute with "Spark submit".
- SSH mode:
The Stambia generated Java code and dependencies are sent to Spark cluster with SSH.
Then, also with SSH, java code is compiled on Spark cluster.
Finally, "Spark submit" is executed to run jobs.
Spark metadata
Spark metadata can be used as a stage (between loading and integration) to boost Hadoop mappings.
Spark metadata creation
Here is an example of a common Metadata Configuration:
The main elements of this Spark Matadata are:
- The Spark Master URL: the connection to the Spark cluster, regarding your context.
You have many ways to set it: kubernates, local, yarn, spark, mesos, ...
- JDK & Spark Home, work directory, ...: the path to used directories, relative to the cluster you are working one
(which can be different from the designer you are using)
- Related metadata: Spark can use external metadata, such as SSH, Kerberos, HDFS, ...
- You can add extra Spark parameters, such has described in Spark documentation: Spark Configuration
- You can refer external Java JAR that will be used with your Spark jobs:
- If you need to add extra properties, you can create your own under the root or schema nodes:
Server properties configuration
Detailed Spark metadata configuration parameters are:
Root node
Property | Mandatory | Description | Example |
Spark Master URL | Yes | Master URL used to connect to Spark | spark://<host>:<port> |
JDK Home | Yes | Full path to a JDK that will be available on the server that compiles the classes. This will be used as follows: <JDK Home>/bin/javac | /opt/java |
Spark Home | Yes | Home of the spark installation. This will be used to retrieve <Spark Home>/bin/spark-submit as well as <Spark Home>/jars/*.jar | /opt/spark |
SSH Server | No | Drag and drop a SSH Server on this field when Spark cannot be accessed locally by the Runtime | Document Root.Cloudera |
Kerberos Principal | No | Drag and drop a Kerberos Principal on this field when Spark installation is kerberized | Document Root.Kerberos Hadoop |
Proxy User | No | Spark proxy-user to use to impersonate the client connection. Please refer to the Spark documentation for more information. | <Proxy user> |
Data schema node
Beware: some parameters are only available for some Spark master URL.
Property | Mandatory | Description | Example | Spark Master URL |
Work Directory | Yes | Full path to the directory in which the Runtime will compile the Jar files. When SSH is used, this must be a valid path on the SSH Server. | /home/stambia/stambiaRuntime/sparkWork | all |
HDFS Temporary Folder | No | Drag and drop a HDFS Folder on this field. This will be used by some templates when needed to store temporary data in HDFS. This setting can be ignored if none of the templates require temporary HDFS data | Cloudera.cities | all |
Driver Memory | No | Amount of memory to allocate to the Spark Driver Program. Example: 2g. Please refer to the Spark documentation for more information. | 2g | all |
Driver Java Options | No | Specify any additional Java Options to pass when launching the driver program. Please refer to the Spark documentation for more information. | --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.xml -Dconfig.file=app.conf" | all |
Queue | No | Name of the Yarn queue to use when submitting the application. | <YARN queue name> | YARN only |
Executor Memory | No | Amount of memory to use per executor process. Example: 4g. Please refer to the Spark documentation for more information. | 4g | all |
Executor Cores | No | The number of cores to use on each executor. Please refer to the Spark documentation for more information. | 1 | Spark Only |
Number of Executors | No | Number of executors requested to Yarn. | 1 | YARN only |
Total Executor Cores | No | The total number of cores that Spark can use on the cluster. Please refer to the Spark documentation for more information. | 1 | Mesos and Spark only |
JAR File node
Property | Mandatory | Description | Example |
Upload Jar file to Cluster | No | When SSH is used, this option allows to upload a local JAR File to the Cluster instead of referencing an already existing JAR File on this Cluster | Checked = Yes Unchecked = No |
Path | Yes | Full Path to the Jar file on the Spark server | /home/stambia/stambiaRuntime/lib/jdbc/hsqldb.jar |
HDFS metadata links
Spark can often works on HDFS data, so you can link files metadata to HDFS metadata.
For example, in below mapping, we use "personnes" file as source.
This source is a standard file Stambia metadata, but linked to an HDFS matadata with HDFS_LOCATION link.
Mapping examples
Spark can be used in many ways inside Stambia Designer.
Each template is available in dedicated folder:
- Spark 2
- Spark 1.6
Loading and integration
Text files
In this example, we load 2 text files and merge them into one.
- Sources are 2 text files, joined by "id_rep"
- Target is another text file, contained both source file information
- Loading steps use Spark template: Action Process LOAD Hdfs File to Spark
- Integration step use Spark template: Action Process INTEGRATION Spark to Hdfs File
Hive table
Hive table as source
In this example, we load an Hive table into an SQL table.
- Sources contain an Hive table: dim_bil_type
- Target is an SQL table
- Loading steps use Spark template: Action Process LOAD Hive to Spark
Hive table as target
1st way: integrate without load
In this example, we load 2 SQL tables and merge them into an Hive table.
- Sources are 2 SQL tables, joined by "TIT_CODE"
- Target is an Hive table
- Loading steps use Spark template: Action Process LOAD Rdbms to Spark
- Integration step use Spark template: Action Process INTEGRATION Spark to Hive
Note: you alsocan integrate to a more classical SQL (such as Oracle or MS-SQL) with template: Action Process INTEGRATION Spark to Rdbms
2nd way: integrate with load
You can do the same mapping with an extra Spark to Hive loading: Action Process LOAD Spark to Hive
Note: you also can load from Spark to vertica with "Action Process LOAD Spark to Vertica"
HBase table
In this example, we load an HBase tables and into an SQL table.
- Sources contains an HBase table: t_customer
- Target is an SQL table
- Loading steps use 2 Spark template:
- Action Process LOAD HBase to Spark
- Action Process LOAD Spark to Rdbms
Note: you also can use a Vertica table as source with "Action Process LOAD HBase to Vertica"
Staging
With Stambia, you can use the Spark staging step in two ways: SQL or Java.
Regarding the way you choose, you will fill the stage mapping field with the language you choose and Spark jobs will run with this language.
SQL
SQL stage mapping are common with Stambia, for instance with all RDBMS technology.
So, with Spark SQL, there is no difference with RDBMS mappings.
- Sources contains 2 SQL tables
- Stage is "Action Process STAGING Spark as SQL"
Java
Map
The map Spark java mode means that every 1 line in source table, there will also be 1 line at staging (1-to-1 correspondence).
The code inside "Expression Editor" is written in Java.
- Sources contains 2 SQL tables
- Stage is "Action Process STAGING Spark as Java"
Note: for one-line Java code in expression editor, you can forget "return" key-word and semicolon end instruction.
Flatmap
In a Flatmap, for every 1 line in source table, you can have many lines in staging step (1-to-N correspondence).
The code inside "Expression Editor" is written in Java.
You have to return an "Iterator" interface in the mapping step.
So,ytou can use use your own objects in return they implements the "Iterator" interface.
Common mapping parameters
Every templates has is own parameters, so you can refer to Stambia Designer internal documentation to have details about every one of them.
However, here are the most common parameters:
- cleanTemporaryObjects: If true, the temporary objects created during integration are removed at the end of the process.
- coalesceCount: If not empty, specifies the number of partitions resulting of a coalesce() operation on the Dataset.
- compile: When this option is set to true, the application is compiled and a JAR File is created.
- debug: When this option is set to true, additional information about the Datasets is writen to the standard output.
- executionUnit: When multiple templates share the same Execution Unit name, only one of them will submit the JAR file embedding the Java programs for all other templates.
- persistStorageLevel: Allows to select a persistance of the main Dataset.
- repartitionCount: If not empty, specifies the number of partitions resulting of a repartition() operation on the main Dataset.
- repartitionMethod: Specifies how the data is to distributed
- useDistinct: If true, duplicate records will be filtered out.
- submit: When this option is set to true, the JAR file for this Execution Unit is executed using a spark-submit command.
- workFolder: Specify the location where the Java files are to be generated locally before being sent to the compile directory specified in the Spark metadata.