The Hadoop Distributed File System (HDFS) is the core of the Apache Hadoop framework.
This particular file system is used to store and process data through systems like MapReduce or YARN.
You can manipulate it in Stambia through the HDFS Metadata and Tools, that are presented in this article.
Prerequisites:You must install the Hadoop connector to be able to work with HDFS.
Please refer to the following article that will guide you to accomplish this.
Metadata
The first step, when you want to work with HDFS in Stambia DI, consists of creating and configuring the HDFS Metadata.
Here is a summary of the main steps to follow:
- Creation of the HDFS Metadata
- Configuration of the server properties
- (Optional) Configuration of the Kerberos security
- Definition of the HDFS folders
And here is an example of a common Metadata Configuration
Metadata creation
Create first the HDFS Metadata, as usual, by selecting the technology in the Metadata Creation Wizard:
Click next, choose a name and click on finish.
Configuration of the server properties
The connector offers the possibility to use several APIs to perform the operations, leaving the choice of the preferred method to the user.
This will impact on the way Stambia will connect and perform operations on HDFS.
Depending on the API you are planning to use, you don't have to specify all of the API properties, but only the ones for the API you chose.
The following APIs are available:
API | Description | Kerberos security support |
Java |
The Java APIs provided by Apache and / or the Hadoop distribution are used to perform the operations. Stambia will use the libraries and utilities installed with the connector, meaning that the Hadoop HDFS libraries must be installed in Stambia to use this. (See installation article.) |
Yes |
Web HDFS |
The Web HDFS APIs consists of using RESTful web APIs to perform the operations. Stambia will invoke the REST APIs corresponding to the operations. |
Not currently supported |
NFS Export |
Stambia performs the operations directly through a NFS Gateway. The NFS Gateway must be installed on the system where the Runtime is located. Please refer to the Apache documentation for further information about how to install a NFS Gateway for HDFS on the file system. |
Yes |
Command Line [Over SSH] |
The Hadoop command line APIs are used to perform the operations. Stambia will execute the commands corresponding to the operations, in the local system or in a remote server through SSH, depending on the chosen option. |
Coming soon |
In the HDFS Metadata, click on the root node and specify the default API you want to use when working with HDFS.
You'll be able to change it, if needed, in the templates options later, but this is the default one the templates and tools should use if not specified.
Then define the properties of the APIs accordingly to your server configuration:
Here are the available properties, with examples:
Property | Description | Apply for API | Examples |
Name | Label of the Hadoop server | ||
Default API |
Default Hadoop API used to operate on the server
|
||
Java Hdfs URL |
Base URL used by the Java API to perform the operations. |
Java |
hdfs://sandbox.hortonworks.com hdfs://quickstart.cloudera maprfs:///mapr.sandboc |
Hadoop Configuration Files |
Hadoop stores information about the services properties in configurations file such as core-site.xml and hdfs-site.xml. These files are XML files containing a list of properties and information about the Hadoop server. Depending on the environment, network, and distributions, these files might be required for the Java API to be able to contact and operate on HDFS. There is therefore the possibility to specify these files in the Metadata to avoid network and connection issues, for instance. For this simply specifies them with a comma separated list of paths pointing to their location. They must be reachable by the Runtime. |
Java | D:/hadoop/hdfs/core-site.xml,D:/hadoop/hdfs/hdfs-site.xml |
Httpfs URL | Web HDFS |
http://<hostname>:<port>/webhdfs/v1 http://quickstart.cloudera:50070/webhdfs/v1 |
|
Webhdfs URL |
WEBHDFS FileSystem URI used by WebHDFS API |
Web HDFS |
webhdfs://<hostname>:<port> webhdfs://quickstart.cloudera:50070 |
Hadoop Home |
Root directory where the HDFS command line tools can be found. This should be the directory just before the "bin" folder. This is used by Stambia, for the Command Line API, to calculate the path of the hdfs command to execute. |
Command Line | /usr/ |
If you are using the Command Line Over SSH API, you must drag and drop a SSH Metadata Link containing the SSH connection information in the HDFS Metadata.
Rename it to 'SSH'.
Configuration of the Kerberos Security
When working with Kerberos secured Hadoop clusters, connections will be protected, and you'll therefore need to specify in Stambia the credentials and necessary information to perform the Kerberos connection.
Java, Command Line, and Command Line Over SSH APIs
A Kerberos Metadata is available to specify everything required for using Kerberos with these APIs:
- Create a new Kerberos Metadata (or use an existing one)
- Define inside the Kerberos Principal to use for HDFS
- Drag and drop it in the HDFS Metadata
- Rename the Metadata Link to 'KERBEROS'
Notes:
- The 'Command Line' API will use the 'Kerberos Local Keytab File Path' property of the Kerberos Metadata
- The 'Command Line Over SSH API' will use the 'Kerberos Remote Keytab File Path' property of the Kerberos Metadata
- Refer to this dedicated article for further information about the Kerberos Metadata configuration
Other APIs
Web HDFS
Kerberos is not currently supported with the Web HDFS API, please use another API if your cluster is secured with Kerberos.
NFS Export
Stambia will perform the operations directly on the NFS Gateway, so there is nothing to be done Stambia side for the Kerberos security when using this API.
This is the NFS Gateway that must be configured to use Kerberos.
Please refer to the Apache documentation for more information on how to do this.
Definition of the HDFS folders
The server and API properties being configured, you can now create in your Metadata the HDFS Folder nodes on which you are planning to work.
Right click on the root node, and choose New > Folder
Then, specify the HDFS path of the folder.
Performing HDFS operations
Once the Metadata is configured, you can now start making operations on HDFS.
You have for this at your disposal a list of TOOLS dedicated to each operation, which you can find under the Hadoop Templates.
templates.hadoop/hdfs
The following tools are available:
Name | Description |
TOOL HDFS File Mkdir | Create an HDFS directory |
TOOL HDFS File Put | Send a file to HDFS |
TOOL HDFS File Mv | Move a file or folder between |
TOOL HDFS File Get | Retrieve a file from HDFS locally |
TOOL HDFS File Set Properties | Set properties on a file or directory, such as permissions, owner, group, or replication |
TOOL HDFS File Delete | Delete a file or folder from HDFS |
Using the tools
To use the tools presented earlier, follow these steps:
- Drag and drop the tool in a Process
- Drag and drop HDFS Metadata Link in the Process or directly on the tool
- Set the properties accordingly to your needs
Example:
In this example we are using the tool dedicated to send files to HDFS.
For this we drag and dropped the tool on the process(1), our source file Metadata node (2), our target HDFS directory (2), and filled the parameters (3).
We are using here XPATH expressions to retrieve automatically the paths information from the Metadata Links.
Note: For further information, please consult the tool's Process and parameters description.
Demonstration Project
The Hadoop demonstration project that you can find on the download page contains examples for most of the HDFS Tools.
Do not hesitate to have a look at this project to find samples and examples on how to use them.