Getting started with HDFS

Metadata

The first step, when you want to work with HDFS in Stambia DI, consists of creating and configuring the HDFS Metadata.

Here is a summary of the main steps to follow:

Creation of the HDFS Metadata
Configuration of the server properties
(Optional) Configuration of the Kerberos security
Definition of the HDFS folders

And here is an example of a common Metadata Configuration

HDFSMetadataOverview

Metadata creation

Create first the HDFS Metadata, as usual, by selecting the technology in the Metadata Creation Wizard:

HDFSNewMetadata

Click next, choose a name and click on finish.

Configuration of the server properties

The connector offers the possibility to use several APIs to perform the operations, leaving the choice of the preferred method to the user.

This will impact on the way Stambia will connect and perform operations on HDFS.

Depending on the API you are planning to use, you don't have to specify all of the API properties, but only the ones for the API you chose.

The following APIs are available:

API	Description	Kerberos security support
Java	The Java APIs provided by Apache and / or the Hadoop distribution are used to perform the operations. Stambia will use the libraries and utilities installed with the connector, meaning that the Hadoop HDFS libraries must be installed in Stambia to use this. (See installation article.)	Yes
Web HDFS	The Web HDFS APIs consists of using RESTful web APIs to perform the operations. Stambia will invoke the REST APIs corresponding to the operations.	Not currently supported
NFS Export	Stambia performs the operations directly through a NFS Gateway. The NFS Gateway must be installed on the system where the Runtime is located. Please refer to the Apache documentation for further information about how to install a NFS Gateway for HDFS on the file system.	Yes
Command Line [Over SSH]	The Hadoop command line APIs are used to perform the operations. Stambia will execute the commands corresponding to the operations, in the local system or in a remote server through SSH, depending on the chosen option.	Coming soon

In the HDFS Metadata, click on the root node and specify the default API you want to use when working with HDFS.

You'll be able to change it, if needed, in the templates options later, but this is the default one the templates and tools should use if not specified.

Then define the properties of the APIs accordingly to your server configuration:

HDFSServerProperties

Here are the available properties, with examples:

Property	Description	Apply for API	Examples
Name	Label of the Hadoop server
Default API	Default Hadoop API used to operate on the server Java Web HDFS NFS Export Command Line Command Line Over SSH
Java Hdfs URL	Base URL used by the Java API to perform the operations.	Java	hdfs://sandbox.hortonworks.com hdfs://quickstart.cloudera maprfs:///mapr.sandboc
Hadoop Configuration Files	Hadoop stores information about the services properties in configurations file such as core-site.xml and hdfs-site.xml. These files are XML files containing a list of properties and information about the Hadoop server. Depending on the environment, network, and distributions, these files might be required for the Java API to be able to contact and operate on HDFS. There is therefore the possibility to specify these files in the Metadata to avoid network and connection issues, for instance. For this simply specifies them with a comma separated list of paths pointing to their location. They must be reachable by the Runtime.	Java	D:/hadoop/hdfs/core-site.xml,D:/hadoop/hdfs/hdfs-site.xml
Httpfs URL	HTTP URL used by the WebHDFS API	Web HDFS	http://<hostname>:<port>/webhdfs/v1 http://quickstart.cloudera:50070/webhdfs/v1
Webhdfs URL	WEBHDFS FileSystem URI used by WebHDFS API	Web HDFS	webhdfs://<hostname>:<port> webhdfs://quickstart.cloudera:50070
Hadoop Home	Root directory where the HDFS command line tools can be found. This should be the directory just before the "bin" folder. This is used by Stambia, for the Command Line API, to calculate the path of the hdfs command to execute.	Command Line	/usr/

If you are using the Command Line Over SSH API, you must drag and drop a SSH Metadata Link containing the SSH connection information in the HDFS Metadata.

Rename it to 'SSH'.

HDFSCommandlineSSH

Configuration of the Kerberos Security

When working with Kerberos secured Hadoop clusters, connections will be protected, and you'll therefore need to specify in Stambia the credentials and necessary information to perform the Kerberos connection.

Java, Command Line, and Command Line Over SSH APIs

A Kerberos Metadata is available to specify everything required for using Kerberos with these APIs:

Create a new Kerberos Metadata (or use an existing one)
Define inside the Kerberos Principal to use for HDFS
Drag and drop it in the HDFS Metadata
Rename the Metadata Link to 'KERBEROS'

HDFSKerberos

Notes:

The 'Command Line' API will use the 'Kerberos Local Keytab File Path' property of the Kerberos Metadata

The 'Command Line Over SSH API' will use the 'Kerberos Remote Keytab File Path' property of the Kerberos Metadata

Refer to this dedicated article for further information about the Kerberos Metadata configuration

Other APIs

Web HDFS

Kerberos is not currently supported with the Web HDFS API, please use another API if your cluster is secured with Kerberos.

NFS Export

Stambia will perform the operations directly on the NFS Gateway, so there is nothing to be done Stambia side for the Kerberos security when using this API.

This is the NFS Gateway that must be configured to use Kerberos.

Please refer to the Apache documentation for more information on how to do this.

Definition of the HDFS folders

The server and API properties being configured, you can now create in your Metadata the HDFS Folder nodes on which you are planning to work.

Right click on the root node, and choose New > Folder

HDFSNewFolder

Then, specify the HDFS path of the folder.

HDFSFolder

Performing HDFS operations

Once the Metadata is configured, you can now start making operations on HDFS.

You have for this at your disposal a list of TOOLS dedicated to each operation, which you can find under the Hadoop Templates.

templates.hadoop/hdfs

The following tools are available:

Name	Description
TOOL HDFS File Mkdir	Create an HDFS directory
TOOL HDFS File Put	Send a file to HDFS
TOOL HDFS File Mv	Move a file or folder between
TOOL HDFS File Get	Retrieve a file from HDFS locally
TOOL HDFS File Set Properties	Set properties on a file or directory, such as permissions, owner, group, or replication
TOOL HDFS File Delete	Delete a file or folder from HDFS

Using the tools

To use the tools presented earlier, follow these steps:

Drag and drop the tool in a Process
Drag and drop HDFS Metadata Link in the Process or directly on the tool
Set the properties accordingly to your needs

Example:

HDFSToolPut

In this example we are using the tool dedicated to send files to HDFS.

For this we drag and dropped the tool on the process(1), our source file Metadata node (2), our target HDFS directory (2), and filled the parameters (3).

We are using here XPATH expressions to retrieve automatically the paths information from the Metadata Links.

Note: For further information, please consult the tool's Process and parameters description.

Demonstration Project

The Hadoop demonstration project that you can find on the download page contains examples for most of the HDFS Tools.

Do not hesitate to have a look at this project to find samples and examples on how to use them.