Introduction to Hadoop

Learning objectives

By the end of this workshop, you will be able to

  1. Create, manage and navigate files and directories on the Hadoop Distributed File System (HDFS).
  2. Understand the concept of data locality in distributed system.
  3. Understand how big data files are distributed across HDFS and manipulated by MapReduce programming paradigm to facilitate data locality.
  4. Write Streaming MapReduce Python programs to analyze a large data set.
  5. Integrate Streaming MapReduce processing with standard Python programs to facilitate complex analysis.


This workshop requires prerequisites knowledges that are equivalent to the following COE workshops:

  1. Introduction to research computing on the Palmetto Cluster.
  2. Introduction to Linux.
  3. Introduction to Python.


These lessons are modeled after the structure of Data Carpentry lesson materials, an open source project. Like Data Carpentry, we welcome contributions of all kinds: new lessons, fixes/improvements to existing material, corrections to typos, bug reports, and reviews of proposed changes are all equally welcome. Please see our page on Contributing to get started.

Introduction to Hadoop

  1. Introduction to the Hadoop Distributed File System (HDFS)
  2. Interaction with the Hadoop cluster
  3. Files and Directories
  4. MapReduce Programming Paradigm
  5. Running a Streaming MapReduce program
  6. Creating a Mapper
  7. Creating a Reducer
  8. Integrating MapReduce