Introduction to Hadoop

Interacting with Hadoop

Learning objectives

  • Learn how to access the web-based Jupyter notebook.
  • Learn how to use the Hadoop command in Jupyter shells.
  • Learn how to access the web UI of the Hadoop Distributed File System.

In this workshop, we will leverage the Jupyter infrastructure at Clemson University to directly interact with Hadoop.


To start using the Jupyter notebook, go to and sign in with your Clemson credentials. Next, click Start My Server to spawn a new Jupyter notebook. You should see the content of your home directory on Palmetto under Files.
Under New, create a new folder. This folder will appear immediately in your home directly with the name Untitled Folder. Check the selection box next to this folder, a button called Rename will appear below the Files tab. Click this button to change this folder to a name of your choice. Click on this folder to go to the next level.
Create New Folder
Use the menu under New once again to create a new Jupyter notebook using Python 3.0 distributed through Anaconda 2.5.0 by Continuum.
Create New Folder
Change the name of this notebook to “Introduction to Hadoop”.
Create New Folder
For this workshop, the default codes inside a cell will be interpreted as Python language. However, any line that begins with ! will be interpreted as a Linux system command.

print "Hello World"
Hello World
!ls -l /
HDFS commands

HDFS provides a set of commands for users to interact with the system from a Linux-based terminal. To view all available HDFS systems commands, run the following in a cell:

!ssh dsciu001 hdfs
Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  classpath            prints the classpath
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  mover                run a utility to move block replicas across
                       storage types
  oiv                  apply the offline fsimage viewer to an fsimage
  oiv_legacy           apply the offline fsimage viewer to an legacy fsimage
  oev                  apply the offline edits viewer to an edits file
  fetchdt              fetch a delegation token from the NameNode
  getconf              get config values from configuration
  groups               get the groups which users belong to
  snapshotDiff         diff two snapshots of a directory or diff the
                       current directory contents with a snapshot
  lsSnapshottableDir   list all snapshottable dirs owned by the current user
                        Use -help to see options
  portmap              run a portmap service
  nfs3                 run an NFS version 3 gateway
  cacheadmin           configure the HDFS cache
  crypto               configure HDFS encryption zones
  storagepolicies      list/get/set block storage policies
  version              print the version

Most commands print help when invoked w/o parameters.

For this workshop, we are interested in file system commands. Create a new cell and run the following:

!ssh dsciu001 hdfs dfs
Usage: hadoop fs [generic options]
    [-appendToFile <localsrc> ... <dst>]
    [-cat [-ignoreCrc] <src> ...]
    [-checksum <src> ...]
    [-chgrp [-R] GROUP PATH...]
    [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
    [-chown [-R] [OWNER][:[GROUP]] PATH...]
    [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
    [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    [-count [-q] [-h] [-v] [-t [<storage type>]] <path> ...]
    [-cp [-f] [-p | -p[topax]] <src> ... <dst>]
    [-createSnapshot <snapshotDir> [<snapshotName>]]
    [-deleteSnapshot <snapshotDir> <snapshotName>]
    [-df [-h] [<path> ...]]
    [-du [-s] [-h] <path> ...]
    [-find <path> ... <expression> ...]
    [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    [-getfacl [-R] <path>]
    [-getfattr [-R] {-n name | -d} [-e en] <path>]
    [-getmerge [-nl] <src> <localdst>]
    [-help [cmd ...]]
    [-ls [-d] [-h] [-R] [<path> ...]]
    [-mkdir [-p] <path> ...]
    [-moveFromLocal <localsrc> ... <dst>]
    [-moveToLocal <src> <localdst>]
    [-mv <src> ... <dst>]
    [-put [-f] [-p] [-l] <localsrc> ... <dst>]
    [-renameSnapshot <snapshotDir> <oldName> <newName>]
    [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
    [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
    [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
    [-setfattr {-n name [-v value] | -x name} <path>]
    [-setrep [-R] [-w] <rep> <path> ...]
    [-stat [format] <path> ...]
    [-tail [-f] <file>]
    [-test -[defsz] <path>]
    [-text [-ignoreCrc] <src> ...]
    [-touchz <path> ...]
    [-truncate [-w] <length> <path> ...]
    [-usage [cmd ...]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

We can see that HDFS provides a number of file system commands that are quite similar to their Linux counterpart. For example, -chown and -chmod change ownership and permission of HDFS files and directories, -ls lists content of a directory, -mkdir creates new directory, -rm removes files and directories, and so on.

When a Hadoop cluster is first started, there is no data. Users usually import data into the cluster from the traditional Linux-based file system. This is done by using the commandOption -put. Vice versa, to move data from HDFS back to a Linux-based file system, commandOption -get is used.

HDFS Web Interface

At Clemson University, the Hadoop Big Data infrastructure is called the Cypress cluster. It uses an open source flavor of Hadoop distributed by Hortonworks. HDFS provides a web-based user interface for users to view stored data. The interface is hosted on HDFS’ NameNode, which is replicated to ensure uninterrupted operation. The URLs of the NameNode replicates are:
This figure shows the interfaces of the two HDFS NameNode replications. Only the active instance (left) can be used to view files and directories.

Check your understanding: Using Jupyter shell to download data

Create a directory named intro-to-hadoop in your home directory on Palmetto

From inside this directory, run the following command to get data from github

!git clone

View this newly cloned directory to confirm that you have the file gutenberg-shakespeare.txt.

Check your understanding: View files and directories on HDFS

View the content of your HDFS user directory (/user/your-username) on Cypress

Check your understanding: Create directory on HDFS

Create a directory in your HDFS user directory named intro-to-hadoop

Check your understanding: Import file to HDFS

Copy the file gutenberg-shakespeare.txt from Palmetto to this newly created intro-to-hadoop directory on HDFS using put. View the content of the intro-to-hadoop directory to confirm that the file has been successfully uploaded.