Welcome Guest! Log in
Stambia versions 2.x, 3.x, S17, S18, S19 and S20 are reaching End of Support January, 15th, 2024. Please consider upgrading to the supported Semarchy xDI versions. See Global Policy Support and the Semarchy Documentation.

The Stambia User Community is moving to Semarchy! All the applicable resources have already been moved or are currently being moved to their new location. Read more…


In this article

The Hadoop Distributed File System (HDFS) is the core of the Apache Hadoop framework.

This particular file system is used to store and process data through systems like MapReduce or YARN.

You can manipulate it in Stambia through the HDFS Metadata and Tools, that are presented in this article.

Prerequisites:

You must install the Hadoop connector to be able to work with HDFS.

Please refer to the following article that will guide you to accomplish this.

 

Metadata

The first step, when you want to work with HDFS in Stambia DI, consists of creating and configuring the HDFS Metadata.

Here is a summary of the main steps to follow:

  1. Creation of the HDFS Metadata
  2. Configuration of the server properties
  3. (Optional) Configuration of the Kerberos security
  4. Definition of the HDFS folders

 

And here is an example of a common Metadata Configuration

HDFSMetadataOverview

 

Metadata creation

Create first the HDFS Metadata, as usual, by selecting the technology in the Metadata Creation Wizard:

HDFSNewMetadata

Click next, choose a name and click on finish.

 

Configuration of the server properties

The connector offers the possibility to use several APIs to perform the operations, leaving the choice of the preferred method to the user.

This will impact on the way Stambia will connect and perform operations on HDFS.

Depending on the API you are planning to use, you don't have to specify all of the API properties, but only the ones for the API you chose.

The following APIs are available:

API Description Kerberos security support
Java

The Java APIs provided by Apache and / or the Hadoop distribution are used to perform the operations.

Stambia will use the libraries and utilities installed with the connector, meaning that the Hadoop HDFS libraries must be installed in Stambia to use this. (See installation article.)

Yes
Web HDFS

The Web HDFS APIs consists of using RESTful web APIs to perform the operations.

Stambia will invoke the REST APIs corresponding to the operations.

Not currently supported
NFS Export

Stambia performs the operations directly through a NFS Gateway.

The NFS Gateway must be installed on the system where the Runtime is located.

Please refer to the Apache documentation for further information about how to install a NFS Gateway for HDFS on the file system.

Yes

Command Line [Over SSH]

The Hadoop command line APIs are used to perform the operations.

Stambia will execute the commands corresponding to the operations, in the local system or in a remote server through SSH, depending on the chosen option.

Coming soon

 

In the HDFS Metadata, click on the root node and specify the default API you want to use when working with HDFS.

You'll be able to change it, if needed, in the templates options later, but this is the default one the templates and tools should use if not specified.

Then define the properties of the APIs accordingly to your server configuration:

HDFSServerProperties

Here are the available properties, with examples:

Property Description Apply for API Examples
Name Label of the Hadoop server    
Default API

Default Hadoop API used to operate on the server

  • Java
  • Web HDFS
  • NFS Export
  • Command Line
  • Command Line Over SSH
   
Java Hdfs URL

Base URL used by the Java API to perform the operations.

Java

hdfs://sandbox.hortonworks.com

hdfs://quickstart.cloudera

maprfs:///mapr.sandboc

Hadoop Configuration Files

Hadoop stores information about the services properties in configurations file such as core-site.xml and hdfs-site.xml.

These files are XML files containing a list of properties and information about the Hadoop server.

Depending on the environment, network, and distributions, these files might be required for the Java API to be able to contact and operate on HDFS.

There is therefore the possibility to specify these files in the Metadata to avoid network and connection issues, for instance.

For this simply specifies them with a comma separated list of paths pointing to their location. They must be reachable by the Runtime.
Java D:/hadoop/hdfs/core-site.xml,D:/hadoop/hdfs/hdfs-site.xml
Httpfs URL

HTTP URL used by the WebHDFS API

Web HDFS

http://<hostname>:<port>/webhdfs/v1

http://quickstart.cloudera:50070/webhdfs/v1

Webhdfs URL

WEBHDFS FileSystem URI used by WebHDFS API

Web HDFS

webhdfs://<hostname>:<port>

webhdfs://quickstart.cloudera:50070

Hadoop Home

Root directory where the HDFS command line tools can be found.

This should be the directory just before the "bin" folder.

This is used by Stambia, for the Command Line API, to calculate the path of the hdfs command to execute.

Command Line /usr/

 

If you are using the Command Line Over SSH API, you must drag and drop a SSH Metadata Link containing the SSH connection information in the HDFS Metadata.

Rename it to 'SSH'.

HDFSCommandlineSSH

 

Configuration of the Kerberos Security

When working with Kerberos secured Hadoop clusters, connections will be protected, and you'll therefore need to specify in Stambia the credentials and necessary information to perform the Kerberos connection.

Java, Command Line, and Command Line Over SSH APIs

A Kerberos Metadata is available to specify everything required for using Kerberos with these APIs:

  1. Create a new Kerberos Metadata (or use an existing one)
  2. Define inside the Kerberos Principal to use for HDFS
  3. Drag and drop it in the HDFS Metadata
  4. Rename the Metadata Link to 'KERBEROS'

 

HDFSKerberos

Notes:

  • The 'Command Line' API will use the 'Kerberos Local Keytab File Path' property of the Kerberos Metadata
  • The 'Command Line Over SSH API' will use the 'Kerberos Remote Keytab File Path' property of the Kerberos Metadata
  • Refer to this dedicated article for further information about the Kerberos Metadata configuration

 

 

Other APIs

Web HDFS

Kerberos is not currently supported with the Web HDFS API, please use another API if your cluster is secured with Kerberos.

 

NFS Export

Stambia will perform the operations directly on the NFS Gateway, so there is nothing to be done Stambia side for the Kerberos security when using this API.

This is the NFS Gateway that must be configured to use Kerberos.

Please refer to the Apache documentation for more information on how to do this.

 

Definition of the HDFS folders

The server and API properties being configured, you can now create in your Metadata the HDFS Folder nodes on which you are planning to work.

Right click on the root node, and choose New > Folder

HDFSNewFolder

Then, specify the HDFS path of the folder.

HDFSFolder

 

Performing HDFS operations

Once the Metadata is configured, you can now start making operations on HDFS.

You have for this at your disposal a list of TOOLS dedicated to each operation, which you can find under the Hadoop Templates.

templates.hadoop/hdfs

 

The following tools are available:

Name Description
TOOL HDFS File Mkdir Create an HDFS directory
TOOL HDFS File Put Send a file to HDFS
TOOL HDFS File Mv Move a file or folder between
TOOL HDFS File Get Retrieve a file from HDFS locally
TOOL HDFS File Set Properties Set properties on a file or directory, such as permissions, owner, group, or replication
TOOL HDFS File Delete Delete a file or folder from HDFS

 

Using the tools

To use the tools presented earlier, follow these steps:

  1. Drag and drop the tool in a Process
  2. Drag and drop HDFS Metadata Link in the Process or directly on the tool
  3. Set the properties accordingly to your needs

 

Example:

HDFSToolPut

 

In this example we are using the tool dedicated to send files to HDFS.

For this we drag and dropped the tool on the process(1), our source file Metadata node (2), our target HDFS directory (2), and filled the parameters (3).

We are using here XPATH expressions to retrieve automatically the paths information from the Metadata Links.

Note: For further information, please consult the tool's Process and parameters description.

 

Demonstration Project

The Hadoop demonstration project that you can find on the download page contains examples for most of the HDFS Tools.

Do not hesitate to have a look at this project to find samples and examples on how to use them.

 

 

Articles

Suggest a new Article!