Data movement on Hadoop

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I move data in and out of the Hadoop cluster

Objectives
  • Know to update .bashrc for the relevant modules.

  • Know how to request nodes and launch the Hadoop cluster.

  • Know the distribution of HDFS components on the nodes.

  • Know how to copy data into and get data out of HDFS.

Updating .bashrc

  • Run the following command
$ echo "module load openjdk/1.8.0_222-b10-gcc/8.3.1 hadoop/3.2.1-gcc/8.3.1" >> ~/.bashrc

Requesting resources

$ qsub -I -l select=3:ncpus=8:mem=14gb:interconnect=1g,walltime=03:30:00

Copying the myhadoop template from /zfs/citi

  • After request is granted
$ cp -R /zfs/citi/myhadoop/ ~/
$ cd ~/myhadoop

Examining the myhadoop template

$ ls -l
$ ls -l bin/
  • init_hadoop.sh: format and launch a new Hadoop cluster on the allocated resources.
  • test_hadoop.sh: quickly test the newly launched cluster.
  • stop_hadoop.sh: stop the Hadoop cluster and clean up all data storage.
  • bin/myhadoop.sh: launch all components of Hadoop.
  • bin/myhadoop_shutdown.sh: stop all components of Hadoop.

Launching myhadoop

$ ./init_hadoop.sh

The final command in init_hadoop.sh will show the results of a system check. A successful launch will show the number of live data nodes being one less than the total number of nodes requested from Palmetto.

Testing myhadoop

$ ./test_hadoop.sh

A succesful test will show the completed run of the test WordCount program

Hadoop main commands

Users can interact with Hadoop via command and subcommand. The primary command to interact with Hadoop is hdfs. A subcommand related to file system operations is dfs. Entering these commands without parameters will give you the usage.

$ hdfs
$ hdfs dfs

Specifying configuration location

We need to specify the location of the configuration files for our hadoop cluster. This can be done by setting the HDAOOP_CONF_DIR environment variable.

$ export HADOOP_CONF_DIR="/home/$USER/hadoop_palmetto/config/"
$ hdfs dfs -mkdir /user/
$ hdfs dfs -mkdir /user/$USER
$ hdfs dfs -ls /user/
$ hdfs dfs -ls /user/$USER

Challenge: creating a directory

Create a directory named intro-to-hadoop inside your user directory on HDFS. Confirm that the directory was successfully created.

Solution:

$ hdfs dfs -mkdir /user/$USER/intro-to-hadoop
$ hdfs dfs -ls /user/$USER

Home directory on HDFS

In HDFS, the home directory is defaulted to be /user/$USER with $USER is your username.

$ hdfs dfs -ls /user/$USER
$ hdfs dfs -ls 
$ hdfs dfs -ls .

Uploading and downloading files

To upload data into HDFS, we use the subsubcommand put. To download data from HDFS, we use the subsubcommand get.

$ hdfs dfs -put /zfs/citi/complete-shakespeare.txt intro-to-hadoop/
$ hdfs dfs -ls intro-to-hadoop
$ hdfs dfs -head intro-to-hadoop/complete-shakespeare.txt
$ hdfs dfs -get intro-to-hadoop/complete-shakespeare.txt ~/shakespeare-complete.txt
$ head ~/shakespeare-complete.txt
$ diff /zfs/citi/complete-shakespeare.txt ~/shakespeare-complete.txt

Uploading and downloading directories

The put and get subsubcommands can also be used to move directories as well as individual files.

$ hdfs dfs -put /zfs/citi/movielens intro-to-hadoop/
$ hdfs dfs -ls intro-to-hadoop
$ hdfs dfs -ls intro-to-hadoop/movielens

Checking health status of files and directories in HDFS:

$ hdfs fsck intro-to-hadoop/ -files -blocks -locations

Key Points

  • HDFS provides an abstract of a file system. Terminal commands are needed for file movements.