Introduction to Hadoop
By the end of this workshop, you will be able to
- Create, manage and navigate files and directories on the Hadoop Distributed File System (HDFS).
- Understand the concept of data locality in distributed system.
- Understand how big data files are distributed across HDFS and manipulated by MapReduce programming paradigm to facilitate data locality.
- Write Streaming MapReduce Python programs to analyze a large data set.
- Integrate Streaming MapReduce processing with standard Python programs to facilitate complex analysis.
This workshop requires prerequisites knowledges that are equivalent to the following COE workshops:
- Introduction to research computing on the Palmetto Cluster.
- Introduction to Linux.
- Introduction to Python.
These lessons are modeled after the structure of Data Carpentry lesson materials, an open source project. Like Data Carpentry, we welcome contributions of all kinds: new lessons, fixes/improvements to existing material, corrections to typos, bug reports, and reviews of proposed changes are all equally welcome. Please see our page on Contributing to get started.