Instructor

  • Instructor: Linh B. Ngo
  • Email: lngo AT clemson DOT edu

Workshop Description

This workshop introduces participants to the Hadoop ecosystem deployable on Palmetto. We will cover Hadoop’s architecture, how it can be deployed on Palmetto, import and export of big-data, basic usage, and how to submit scalable data analysis jobs. This workshop will incorporate the use of JupyterHub and Jupyter “Notebooks”. An understanding of the Linux command line and some Python experience is necessary.

Prerequisites

This workshop requires:

Course Outline

Topic

Description

Setup Preparing for the course
1. Introduction to Hadoop Distributed File System (HDFS) Why do we need another distributed file systems?
2. Data movement on Hadoop How do I move data in and out of the Hadoop cluster
3. Programming and debugging Hadoop MapReduce How do we write programs to leverage HDFS’s data placement?
4. Optimizing Hadoop MapReduce Do we just throw compute nodes at big data?
5. Running Hadoop as a Batch Job Do we have to be in interactive mode?
Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.